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ABSTRACT 

Given  a  set  of  possibly  corrupted  and  incomplete  linear  measurements,  we  leverage  low-dimensional  models  to  best 
explain  the  data  for  provable  solution  quality  in  inversion.  A  non-exhaustive  list  of  examples  includes  sparse  vector 
and  low-rank  matrix  approximation.  Most  of  the  well-known  low  dimensional  models  are  inherently  non-convex. 
However,  recent  approaches  prefer  convex  surrogates  that  “relax”  the  problem  in  order  to  establish  solution 
uniqueness  and  stability.  In  this  paper,  we  tackle  the  linear  inverse  problems  revolving  around  low-rank  matrices  by 
preserving  their  non-convex  structure.  To  this  end,  we  present  and  analyze  a  new  set  of  sparse  and  low-rank  recovery 
algorithms  within  the  class  of  hard  thresholding  methods.  We  provide  strategies  on  how  to  set  up  these  algorithms  via 
basic  “ingredients”  for  different  configurations  to  achieve  complexity  vs.  accuracy  tradeoffs.  Moreover,  we  propose 
acceleration  schemes  by  utilizing  memory-based  techniques  and  randomized,  ?-approximate,  low-rank  projections  to 
speed-up  the  convergence  as  well  as  decrease  the  computational  costs  in  the  recovery  process.  For  all  these  cases,  we 
present  theoretical  analysis  that  guarantees  convergence  under  mild  problem  conditions.  Simulation  results 
demonstrate  notable  performance  improvements  compared  to  state-of-the-art  algorithms  both  in  terms  of  data 
reconstruction  and  computational  complexity. 
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Matrix  Recipes  for 
Hard  Thresholding  Methods 

Anastasios  Kyrillidis  and  Volkan  Cevher 


Abstract 

Given  a  set  of  possibly  corrupted  and  incomplete  linear  measurements,  we  leverage  low-dimensional  models  to  best 
explain  the  data  for  provable  solution  quality  in  inversion.  A  non-exhaustive  list  of  examples  includes  sparse  vector  and 
low-rank  matrix  approximation.  Most  of  the  well-known  low  dimensional  models  are  inherently  non-convex.  However,  recent 
approaches  prefer  convex  surrogates  that  “relax”  the  problem  in  order  to  establish  solution  uniqueness  and  stability.  In  this 
paper,  we  tackle  the  linear  inverse  problems  revolving  around  low-rank  matrices  by  preserving  their  non-convex  structure.  To 
this  end,  we  present  and  analyze  a  new  set  of  sparse  and  low-rank  recovery  algorithms  within  the  class  of  hard  thresholding 
methods.  We  provide  strategies  on  how  to  set  up  these  algorithms  via  basic  "ingredients”  for  different  configurations  to 
achieve  complexity  vs.  accuracy  tradeoffs.  Moreover,  we  propose  acceleration  schemes  by  utilizing  memory-based  techniques 
and  randomized,  e-approximate,  low-rank  projections  to  speed-up  the  convergence  as  well  as  decrease  the  computational  costs 
in  the  recovery  process.  For  all  these  cases,  we  present  theoretical  analysis  that  guarantees  convergence  under  mild  problem 
conditions.  Simulation  results  demonstrate  notable  performance  improvements  compared  to  state-of-the-art  algorithms  both 
in  terms  of  data  reconstruction  and  computational  complexity. 


Index  Terms 

Affine  rank  minimization,  compressed  sensing,  sparse  approximation  algorithms,  hard  thresholding,  greedy  methods,  e-approximation 
schemes,  randomized  algorithms. 


I.  Introduction 

A.  Background 

Compressed  Sensing  (CS)  theory  [1]  lies  at  the  heart  of  exciting  developments  in  the  areas  of  signal  processing  and 
convex/discrete  optimization.  CS  roughly  states  that  a  sparse  vector  signal  can  be  perfectly  reconstructed  from  far  fewer  samples, 
compared  to  its  ambient  dimension,  as  dictated  by  the  well-known  Nyquist-Shannon  theorem.  To  accomplish  this,  efficient  sparse 
approximation  algorithms  have  been  proposed,  accompanied  with  strong  theoretical  convergence  and  approximation  guarantees 
[2]. 

CS  theory  is  based  on  the  sparse  transform  coding  technique.  To  describe  the  main  idea,  let  us  assume  that  x *  £  lZn  is  a 
dense  n-dimensional  vector  of  interest.  Instead  of  processing  x*  in  its  dense  representation,  the  literature  today  offers  signal 
basis  transforms  that  promote  data  compression  and  sparsity.  Thus,  using  the  appropriate  orthonormal  basis  matrix  \k  €  TZnXn, 
x *  can  be  described  as  a  s-sparse  (s  <C  n )  linear  combination  of  atoms  {ipi}™=1  corresponding  to  the  columns  of  \k.  This 
representation  can  be  either  exact,  x*  =  vka,  or  approximate,  x*  «  Vka:,  where  a  £  lZn  denotes  the  set  of  coefficients  with 
only  s  out  of  n  entries  being  nonzero.  Without  loss  of  generality,  we  assume  is  the  identity  matrix,  unless  otherwise  stated. 

Based  on  this  key  observation,  CS  proposes  an  alternative  sampling  theorem  where  the  samples  are  acquired  through  linear 
projections  of  the  signal  of  interest  using  a  (random)  measurement  matrix  and  the  number  of  measurements  needed  for  signal 
recovery  is  much  less  compared  to  traditional  sampling  techniques. 


B.  CS  problem  statement 

In  the  standard  CS  framework,  we  seek  to  reconstruct  a  high-dimensional  signal  x*  £  lZn  through  a  low-dimensional 
observation  vector  y  £  1Z"1  ( m  <  n )  such  that: 

y  =  t&arf  +  e.  (1) 

In  this  setting,  d>  €  7?.mxn  represen(S  the  measurement/sensing  matrix  and  e  £  lZm  is  an  additive  noise  term. 

To  recover  x*  given  y  and  <I>,  unconstrained  least-squares  method  is  the  classic  approach  to  the  solution  of  linear  systems  by 
minimizing  the  data  error  function  g{x)  :=  \\y—  3>a;|||  where  ||  •  ||g  denotes  the  Cj-norm.  Nevertheless,  the  reconstruction  of  x * 
from  y  is  an  ill-posed  problem  and  there  is  no  hope  in  finding  the  true  vector  without  ambiguity;  there  is  an  infinite  number  of 
possible  solutions  that  satisfy  the  linear  system  of  equations.  Therefore,  additional  signal  prior  knowledge  should  be  exploited  in 
the  optimization  solver.  Using  the  fact  that  x*  is  s-sparse,  we  concentrate  on  the  following  constrained  minimization  problem: 

minimize  g{x)  subject  to  ||at||o  <  s,  (2) 

where  ||tc||o  is  the  “norm”  that  counts  the  non-zeros  in  x. 

CS  theory  plays  an  important  role  in  solving  (2):  assuming  signal  sparsity,  the  true  solution  x*  can  be  found  using  m  <C  n 
measurements,  as  long  as  the  geometry  of  sparse  signals  is  preserved  after  projection  on  the  subspace  defined  by  <f>.  To  achieve 
this,  CS  community  concentrates  on  developing  polynomial-time  algorithms  for  sparse  signal  recovery  from  a  limited  number 
of  non-adaptive  samples.  Although  the  collection  of  works  in  this  direction  grows  fast,  the  problem  of  constructing  efficient 
methods  both  in  execution  time  and  signal  recovery  performance  remains  widely  open  [2],  [3], 
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C.  From  compressed  sensing  to  rank  minimization 

In  the  general  affine  rank  minimization  (ARM)  problem,  the  set  of  observations  y  £  1ZP  is  acquired  as  y  =  AX *  +e  where 
A  :  /JZmXn  — >  7 Zp  and  X*  £  jirnXn  )s  the  rank-fc  matrix  (k  <C  min{m,  n})  that  we  desire  to  recover.  As  in  CS,  the  challenge 
is  to  reconstruct  the  true  low-rank  matrix  given  that  p  <C  m  x  n.  For  this  purpose,  we  are  interested  in  finding  the  simplest 
solution  X  of  minimum  rank  that  minimizes  the  data  error  f(X)  :=  \\y  —  AX |||: 


minimize  f(X)  subject  to  rank(X)  <  k. 

XCTZmXn 


(3) 


The  ARM  problem  appears  in  many  applications;  low-dimensional  embedding  [3],  matrix  completion  [4],  image  compression 
[5],  function  learning  [6],  [7]  just  to  name  a  few. 

In  (2),  the  minimization  problem  can  be  equivalently  rewritten  as: 

minimize  f(X)  subject  to  rank(X)  <  k,  (4) 


where  Cn  denotes  the  set  of  square  nxn  diagonal  matrices  and  A  :  /TZnXn  — y  7 Zm  is  a  generic  linear  operator  such  that,  given 
$  £  'JZmxn  in  (1),  the  solutions  x  and  X  of  (2)  and  (4),  respectively,  satisfy  AX  =  for  X  £  £"  with  x  £  7Zn  on  the 
main  diagonal.  In  other  words,  the  problem  in  (2)  is  a  special  case  of  (3). 

We  present  below  some  characteristic  examples  for  the  linear  operator  A: 

Matrix  Completion  (MC):  As  a  motivating  example,  consider  the  famous  Netflix  problem  [8],  a  recommender  system  problem 
where  user  movie  preferences  are  inferred  by  a  limited  subset  of  entries  in  a  database.  Matrix  completion  is  an  ill-posed  problem 
since  the  number  of  variables  exceeds  the  number  of  observations  and,  furthermore,  there  are  entries  for  which  no  knowledge 
is  available. 

To  set-up  the  optimization  problem,  let  Q  =  {(i,j)  :  [X*]ij  is  known}  C  {1, . . .  ,m}  x  {1, . . .  ,n}  be  the  set  of  ordered 
pairs  that  represent  the  coordinates  of  the  observable  entries — here,  [X*]ij  represents  the  element  at  the  i-th  row  and  j-th 
column  of  the  matrix  X*.  The  matrix  completion  problem  can  be  solved  as: 


minimize  II  AnX  —  AnX*  II  subject  to  rank(X)  <  k. 

X6R”*"  '''  V  ' 

where  An  defines  a  linear  mask  over  the  observable  entries  S2.  We  observe  that  (5)  is  a  special  case  of  the  low-rank  ARM 
problem. 

Principal  Component  Analysis  (PCA)  and  Robust  PCA:  In  the  PCA  problem,  we  are  interested  in  reconstructing  X*  from 
the  noisy  set  of  observations  y  =  AX*  +  e  where  A  :  TZmxn  — >  TZP  is  an  identity  linear  map  with  p  =  m  x  n.  Furthermore, 
combining  ideas  from  sparse  and  low-rank  signal  recovery  methods.  Robust  PCA  (RPCA)  problem  deals  with  the  challenge  of 
recovering  a  low  rank  L*  and  a  sparse  component  M*  from  a  complete  data  matrix  Y  such  that  Y  =  L*  +  M*  +e.  We  extend 
the  proposed  framework  for  this  case  in  [9]. 

General  linear  maps:  We  can  consider  many  other  cases  where  A  is  any  arbitrary  linear  map.  As  an  example,  in  the 
experiments  we  consider  the  case  where  A  is  constituted  by  permuted  noiselets  [10],  Recent  developments  further  indicate 
connections  of  ridge  function  learning  with  the  ARM  problem  with  general  linear  operators  A  [6], 


D.  Two  camps  of  recovery  algorithms 

Due  to  the  non-convexity  of  ||  •  ||()  and  rank(-)  constraints,  there  are  no  known  polynomial-time  algorithms  that  solve  (2), 
(3)  with  guaranteed  optimality,  without  further  assumptions  on  the  measurement  operator.  As  a  consequence,  various  convex 
relaxations  have  been  proposed  to  approximate  the  solution.  In  [11],  Donoho  et  al.  demonstrate  that,  in  the  sparse  approximation 
problem,  under  basic  incoherence  properties  of  the  sensing  matrix  $  and  given  x*  is  sufficiently  sparse,  the  combinatorial 
“norm”  ||  ■  ||o  in  (2)  can  be  substituted  by  its  sparsity-inducing  convex  surrogate  ||  •  ||i  with  provable  guarantees  for  unique 
signal  recovery.  In  the  ARM  problem,  Fazel  et  al.  [12]  identified  the  nuclear  norm  ||-X}|*  :=  cr;  as  a  convex  surrogate 

of  rank(Y)  operator  where  we  can  leverage  second-order  optimization  approaches,  such  as  interior-point  methods — here,  Oi 
denotes  the  i-th  singular  value  of  X.  Under  basic  incoherence  properties  of  the  sensing  matrix  A,  [12]  provides  provable 
guarantees  for  unique  signal  recovery. 

Here,  we  restrict  our  attention  to  low-rank  minimization  problem  formulations,  alternative  to  (3).  Given  the  above,  the  equality- 
constrained  nuclear-norm  minimization  problem: 

minimize  ||-Xjl*  subject  to  y  =  AX, 

X6Rm*"  v  ' 

for  the  noiseless  case  and  the  regularized  optimization  problem  in  the  presence  of  noise: 

minimize  -f(X)  +  r|jX||*,  (7) 

xeR”*"  2 


emerge  as  natural  estimators  of  X* — in  (7),  r  >  0  balances  the  error  norm  and  the  rank  of  the  solution.  Based  on  the  LASSO 
operator  for  the  vector  case  [13],  (3)  can  also  be  relaxed  to: 


minimize  f(X )  subject  to  ||X|L  <  A, 

XER"*" 


(8) 


where  A  >  0  is  a  regularization  parameter  that  governs  the  rank  of  the  solution. 


3 


Once  (2),  (3)  are  relaxed  to  a  convex  problem,  decades  of  knowledge  on  convex  analysis  and  optimization  can  be  leveraged. 
Interior  point  methods  find  a  solution  with  fixed  precision  in  polynomial  time  but  their  complexity  might  be  prohibitive  even  for 
moderate-sized  problems  [14],  [15],  More  suitable  for  large-scale  data  analysis,  first-order  methods  constitute  low-complexity 
alternatives  [16]— [19], 

In  contrast  to  the  conventional  convex  relaxation  approaches,  iterative  greedy  algorithms  maintain  the  combinatorial  nature 
of  (2),  (3).  Unfortunately,  solving  (2),  (3)  optimally  is  in  general  NP-hard  [20].  Due  to  this  computational  intractability,  the 
algorithms  in  this  class  greedily  refine  a  s-sparse/rank-fc  solution  using  only  “local”  information  available  at  the  current  iteration 
[21],  [22], 

E.  Contributions. 

In  this  work,  we  mainly  focus  on  the  ARM  problem,  unless  otherwise  stated,  and  exploit  a  special  class  of  iterative  algorithms 
known  as  the  hard  thresholding  methods.  Similar  results  can  be  derived  for  the  vector  case  in  a  straightforward  way  [23].  Note 
that  the  transition  from  sparse  vector  approximation  to  ARM  is  non-trivial;  while  s-sparse  signals  “live”  in  the  union  of  finite 
number  of  subspaces,  the  set  of  rank-fc  matrices  expand  to  infinitely  many  subspaces.  Thus,  the  selection  rules  do  not  generalize 
in  a  straightforward  way. 

To  this  end,  we  propose  and  analyze  acceleration  schemes  for  this  class  of  algorithms  with  applications  to  low  rank  matrix 
approximation  in  linear  inverse  systems.  Our  contributions  are  the  following: 

“Ingredients”  of  hard  thresholding  methods:  We  analyze  the  behaviour  and  performance  of  hard  thresholding  methods  from 
a  global  perspective.  Three  basic  building  blocks  (“ingredients”)  are  studied:  i)  step  size  selection  /x;,  ii)  memory  exploitation, 
and  Hi)  gradient  or  least-squares  updates  over  restricted  low-rank  subspaces  (e.g.,  adaptive  block-coordinate  descent).  We 
highlight  the  impact  of  these  key  pieces  on  the  convergence  rate  and  signal  reconstruction  performance  and  provide  optimal 
and/or  efficient  strategies  on  how  to  set  up  these  “ingredients”  under  different  problem  conditions. 

Low-rank  matrix  approximations  in  hard  thresholding  methods:  In  [24],  the  authors  indicate  the  connection  of  combina¬ 
torial  model  projections  in  CS  with  the  modular  set  function  optimization  problem.  [24]  shows  that  the  solution  efficiency  can 
be  significantly  improved  by  e-approximation  algorithms.  Based  on  this  new  algorithmic  definition,  we  analyze  the  impact  of 
e-approximation  low  rank-revealing  schemes  in  the  proposed  algorithms  with  well-characterized  time  and  space  complexities. 
Moreover,  we  provide  extensive  analysis  to  prove  convergence  using  approximate  low-rank  projections. 

Hard  thresholding-based  framework  with  improved  convergence  conditions:  We  study  four  hard  thresholding  variants 
that  provide  salient  computational  trade-offs  for  the  class  of  greedy  methods  for  low-rank  matrix  recovery.  These  methods,  as 
they  iterate,  optimally  exploit  the  non-convex  scaffold  of  low  rank  matrices  on  which  the  approximation  problem  resides.  Using 
simple  analysis  tools,  we  derive  improved  conditions  that  guarantee  convergence,  compared  to  state-of-the-art  approaches. 

The  organization  of  the  paper  is  as  follows.  In  Section  II,  we  set  up  the  notation  and  provide  some  definitions  and  properties, 
essential  for  the  rest  of  the  paper.  In  Section  III,  we  describe  the  basic  algorithmic  framworks  in  a  nutshell,  on  which  the  rest 
of  the  work  elaborates,  while  in  Section  IV  we  provide  important  “ingredients”  for  the  class  of  hard-thresholding  methods — 
detailed  convergence  analysis  proofs  are  provided  in  Section  V.  The  complexity  analysis  of  the  proposed  algorithms  is  provided 
in  Section  VI.  We  study  two  acceleration  schemes  in  Sections  VII  and  VIII,  based  on  memory  utilization  and  e-approximate 
low-rank  projections,  respectively.  We  further  improve  convergence  speed  by  exploiting  randomized  SVD  low  rank  projections 
based  on  power  iteration-based  subspace  finder  tools  [25].  We  provide  empirical  support  for  our  claims  for  better  data  recovery 
performance  and  reduced  complexity  through  experimental  results  on  synthetic  data  in  Section  X.  Finally,  we  conclude  by 
providing  our  perspective  on  the  future  directions  in  Section  XI. 

II.  Elementary  Definitions  and  Properties 

Notation.  We  reserve  lower-case  and  bold  lower-case  letters  for  scalar  and  vector  variable  representation,  respectively.  Bold 
upper-case  letters  denote  matrices  while  bold  calligraphic  upper-case  letters  represent  linear  operators.  We  reserve  calligraphic 
upper-case  letters  for  set  representations.  For  a  matrix  X  £  7\lm><n  an(j  a  sej  Gf  indices  I  £  {1,2, . . .  ,  n},  the  matrix  Xi 
denotes  the  submatrix  of  X  with  columns  indexed  by  the  set  I.  We  use  X(i)  £  7?mxrl  to  represent  the  current  matrix  estimate 
at  the  i-th  iteration. 

The  rank  of  X  is  denoted  as  rank(X)  <  min{m,  n}.  The  empirical  data  error  is  denoted  as  f(X)  :=  \\y  —  A-VHl  with 
gradient  V/(X)  :=  —2 A*(y  —  AX ),  where  *  is  the  adjoint  operation  over  the  linear  operator  A.  The  inner  product  between 
matrices  A,  B  £  ']Zrnxn  is  denoted  as  (A,  B)  =  trac e(BT  A),  where  T  represents  the  transpose  operation.  /  represents  an 
identity  matrix  with  dimensions  apparent  from  the  context. 

Let  T  denote  the  set  of  non-collinear,  non-zero  rank-1  matrices  in  'JZrnxn — this  set  defines  an  uncountably  infinite  number 
of  matrices  in  a  finite  dimensional  space.  Furthermore,  let  U  C  T  denote  the  superset  that  includes  all  the  sets  of  orthonormal, 
rank-1  matrices  that  span  a  subspace  included  in  the  subspace  induced  by  T — i.e.,  for  all  sets  of  orthonormal,  rank-1  matrices 
S  £U,  the  following  hold  true:  i)  span(<S)  C  span(T)  where  span(-)  denotes  the  subspace  spanned  by  the  input  set  of  rank-1 
matrices,  ii)  (Si,  Sj)  =  0,  \/Si ,  Sj  £  S,  i  ^  j,  and  in)  rank(S'i)  =  rankjSy)  =  1.  With  slight  abuse  of  notation,  given  a 
set  of  rank-1  matrices  5,  rank(span(5))  calculates  the  maximum  rank  of  matrices  that  lie  on  the  subspace  spanned  by  the  set 
S.  Given  a  finite  set  S  £U,  |S|  denotes  the  cardinality  of  S.  For  any  matrix  X.  we  use  R{X)  and  N(X)  to  denote  its  range 
and  nullspace,  respectively. 
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A.  Matrix  decompositions  and  subspace  projections 

Given  a  finite  set  S  £  M,  we  declare  that  S  spans  a  subspace  of  jZmxrL  if  and  only  if  any  matrix  X  £  X’nxri  that  lies  in 
this  subspace  can  be  decomposed  as  a  linear  combination  of  rank-1  matrices  from  S  [26] — i.e., 

rank  (A-) 

3 a;  £  TZ+,  i  —  1,. . .  ,rank(X)  such  that  X  =  otiSi,  St  £  S.  (9) 

i= 1 

Moreover,  we  denote  a  set  of  orthonormal,  rank-1  matrices  that  span  the  subspace  induced  by  X  as  ortho(X).  ortho(X)  is 
obtained  as  a  result  of  the  following  optimization  problem: 

ortho(X)  :=  argmin{|iJ|  :  Z  C.U,X  £  span  (Z),  ( Zt ,  Zf)  =  0,  i  ^  j,  WZi,  Z j  £  Z}.  (10) 

z 

Remark  1.  For  a  matrix  X  £  TZmXn,  ortho(X)  is  not  unique — there  is  an  infinite  uncountable  number  of  sets  of  orthonormal, 
rank-1  matrices  that  span  a  given  subspace. 

Given  a  set  5  C  U,  we  denote  the  orthogonal  projection  operator  onto  the  subspace  induced  by  S  as  Vs\  furthermore, 
we  denote  the  orthogonal  projection  operator  onto  the  orthogonal  subspace  of  S  as  Vs-  Thus,  for  arbitrary  Y  £  7ZmXn, 
VsY  £  'JZrnXn  denotes  the  orthogonal  projection  of  Y  onto  the  subspace  spanned  by  S  and  V$  V  £  jZmxn  denotes  the 
orthogonal  projection  of  Y  onto  the  orthogonal  subspace  defined  by  S. 

Remark  2.  Given  S  C  Zf ,  Vs  is  an  idempotent  linear  transformation,  that  is,  \/X  £  lZrnxn,  VsVsX  =  VsX. 

For  an  arbitrary  set  5  C  if,  we  can  always  decompose  a  matrix  X  £  ]Zmxn  into  two  matrix  components,  as  follows: 

X  :=  VsX+VsX,  such  that  {VsX,VgX)  =  0. 

Remark  3.  Let  X  £  jZrnXn  and  S  be  a  set  of  rank- 1  matrices  such  that  X  £  span(S)  and  rank(span(Sj)  >  rank(X).  Then, 
VsX  -  X — i.e.,  since  X  £  span(S),  the  best  projection  of  X  onto  the  subspace  induced  by  S  is  the  matrix  X  itself. 

Given  two  sets  Si,S2  C  IA.  the  following  holds  true  for  any  matrix  X  £  lZmxn\ 

VSlVs2X  =  Vs2VSlX.  (11) 

With  slight  abuse  of  notation,  we  replace  the  series  of  projection  operations  VsfVs2X  by  one  projection  as  follows:  Vs1\s2X — 
similarly,  Vft  \S2  denotes  the  orthogonal  projection  onto  the  orthogonal  subspace,  spanned  by  <Si  \  <S>2. 


B.  Singular  Value  Decomposition  (SVD)  and  its  properties 

Definition  1.  [SVD]  Let  X  £  ]ZmXn  be  a  rank-k  (k  <  min {m,n})  matrix.  Then,  the  SVD  of  X  is  given  by: 


X  =  uyvt 


[Ua 


Up] 


£  0 

vr 

0  0 

m\ 

(12) 


where  Ua  £  TZmxk,Up  £  nmx(rn-k)  ,V  a  £  TZnXk,Vp  ejlnx(n~k)  and  S  =  diag((n, . . .  ,ak)  £  Llkxk  for  o1, ...  ,ok  £ 
1Z+.  Here,  the  columns  of  U,  V  represent  the  set  of  left  and  right  singular  vectors,  respectively,  and  ai, . . . ,  ak  denote  the 
singular  values. 


Remark  4.  The  SVD  of  a  matrix  X  is  an  orthonormal  decomposition  of  X. 

For  any  matrix  X  £  TVnXn  with  arbitrary  rank(X)  <  min{m, n},  the  best  orthogonal  projection  Vk(X)  onto  the  set  of 
rank-fc  ( k  <  rank(X))  matrices  Ck  '■=  {A  £  TZ'" x n  :  rank(A)  <  k}  defines  the  optimization  problem: 

Vk{X)  =  argmin \\Y  -  Xll  (13) 

yecfc 

The  distinction  between  Vs  for  S  C.U  and  Vk  for  k  £  1Z+  is  apparent  from  context.  According  to  the  Eckart- Young  theorem 
[27],  the  best  rank-A:  approximation  of  a  matrix  X  corresponds  to  its  truncated  SVD.  In  more  detail,  if  X  =  UYVT  according 
to  (12),  then  Vk(X)  :=  UkYkVk  and  contains  the  k  strongest  principal  compoments  of  X  with  the  k  largest  in  magnitude 
singular  values  and  vectors — Sfc  £  ]Zkxk  is  a  diagonal  matrix  that  contains  the  first  k  diagonal  entries  of  S  and  Uk,  Vk 
contain  the  corresponding  left  and  right  singular  vectors,  respectively. 

Given  a  matrix  X  £  TZTnxn,  SVD  naturally  provides  a  set  of  orthonormal,  rank-1  matrices,  S  £  ortho(X),  as  outer  products 
of  the  left  singular  vectors  associated  with  the  non-zero  singular  values,  according  to  the  following  definition. 

Definition  2.  [Orthogonal  projections  using  SVD]  Let  X  £  ]Zrnxn  be  a  matrix  with  arbitrary  rank  and  SVD  decomposition 
given  by  (12).  Then,  S  :=  {ui.uf  :  i  =  1, . . . ,  rank(X)}  constitutes  a  set  of  orthonormal,  rank-1  matrices,  where  u,  denotes 
the  i-th  left  singular  vector,  such  that  X  £  span(S).  Moreover,  the  orthogonal  projection  onto  R(X)  is  given  by  Vs  =  UaU  J, 
as  defined  in  (12). 
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Remark  5.  For  k  <  rank(X),  the  rank-k  subspace  that  includes  most  of  the  energy  of  X  (in  Frobenius  norm )  is  spanned 
by  S  :=  {uiuf  :  i  =  1, ... ,  k}.  where  ut  is  the  singular  vector  associated  with  the  i-th  largest  singular  value  of  X.  The 
corresponding  low  rank  matrix  is  given  by  VsX. 

Remark  6.  Let  X  £  Xmxn  pe  a  matrix  with  arbitrary  rank  and  SVD  decomposition  given  by  (12).  Then,  for  any  S  C  U, 
the  following  holds  true:  ||,Ps.X’||F  <  ||X||  ,  that  is,  the  projection  of  a  matrix  does  not  increase  the  energy  in  terms  of  the 
Frobenius  norm. 


C.  Restricted  Isometry  Property 

Unfortunately,  low  rank  assumption  does  not  guarantee  successful  recovery  of  the  true  matrix  for  any  linear  operator  A.  In  the 
analogous  vector  case,  many  conditions  have  been  proposed  in  the  literature  to  establish  solution  uniqueness  and  reconstruction 
stability  such  as  null  space  property  [28],  exact  recovery  condition  [29],  etc.  For  the  matrix  case,  [12]  proved  the  so-called 
restricted  isometry  property  (RIP)  for  the  affine  rank  minimization  problem. 

Definition  3.  [Rank  Restricted  Isometry  Property  (R-RIP)  for  matrix  linear  operators  [12]]  A  linear  operator  A  :  jZrnXn  — >  1ZP 
satisfies  the  R-RIP  with  constant  4(A)  £  (0, 1)  if  and  only  if: 

(1  -  4(A))||Xj|F  <  ||  AX\\l  <  (1  +  4(A))||Xj|F,  VX  e  lZmxn  such  that  rank(X)  <  k.  (14) 

D.  Some  useful  bounds  using  R-RIP 

In  this  section,  we  present  some  lemmas  that  are  useful  in  our  subsequent  developments — these  lemmas  are  consequences  of 
the  R-RIP  of  A. 

Lemma  1.  [21]  Let  A  :  jZmxn  —t  TZP  be  a  linear  operator  that  satisfies  the  R-RIP  with  constant  5k(A).  Then,  Mv  £  TZP, 
the  following  holds  true: 

||Ps(A*«)||F  <  y/l  +  4(A)|H|2,  (15) 

where  S  CW  is  a  set  of  orthonormal,  rank-1  matrices  such  that  rank(VsX)  <  k,  VX  £  TZmxn. 

Lemma  2.  [21]  Let  A  :  jZrnxn  — >  1ZP  be  a  linear  operator  that  satisfies  the  R-RIP  with  constant  4(A).  Then,  VX  £  lZrnXn, 
the  following  holds  true: 

(l-5k(A))\\TsX\\F  <  ||P5A*APsX||f  <  (1  +  5k(A))\\VsX\\F,  (16) 

where  S  CW  is  a  set  of  orthonormal,  rank-1  matrices  such  that  rank(VsX)  <  k,  VX  £  lZmxn. 

Lemma  3.  [22]  Let  A  :  'R,rnXn  — >  1ZP  be  a  linear  operator  that  satisfies  the  R-RIP  with  constant  <5fc(A)  and  S  C-U  be  a  set 
of  orthonormal,  rank-1  matrices  such  that  rank(VsX)  <  k,  VX  £  lZmXn.  Then,  for  p.  >  0,  A  satisfies: 

\(pVsA*AVs)£[p(l-5k(A)),p(l  +  5k(A))\.  (17) 

where  A (0)  represents  the  range  of  eigenvalues  of  the  linear  operator  13  :  1ZP  — ¥  lZlnXn.  Moreover,  VX  £  lZmxn,  it  follows: 

||(1  -  pVsA*AVs)VsX ||F  <  max {p(l  +  Sk(A ))  -  1, 1  -  **(1  -  4(A))}  ||PSX||F.  (18) 

Lemma  4.  [22]  Let  A  :  ]ZrnXn  —¥  1ZP  be  a  linear  operator  that  satisfies  the  R-RIP  with  constant  4(A)  and  Si,  £2  [111  be 
two  sets  of  orthonormal,  rank-1  matrices  such  that  rank(Ps1us2X)  <  k,  VX  £  7ZmXn.  Then  the  following  inequality  holds: 

\\VSlA* AV^X\\f  <  Sk(A)\\rix\\F1  VX  e  span(S2).  (19) 

A  well-known  lemma  used  in  the  convergence  rate  proofs  of  this  class  of  greedy  hard  thresholding  algorithms  is  defined  next. 

Lemma  5.  [Optimality  condition  [30]]Let  0  C  firnXn  be  a  convex  subspace  and  f  :  0  — >  1Z  be  a  smooth  objective  function 
defined  over  0.  Let  X*  £  0  be  a  local  minimum  of  the  objective  function  f  over  the  set  0.  Then 

(V/(X*),X -X*)  >  0,  VX€0,  (20) 

for  all  convex  sets  0. 


III.  Algrebraic  Pursuits  in  a  nutshell 

In  this  section,  we  provide  an  overview  of  the  proposed  framework.  The  main  theorems  are  presented  in  section  V  where 
detailed  proofs  are  provided  in  the  appendix. 
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Input:  y.  A,  k,  Tolerance  //,  Maxlterations 
Initialize:  X(0)  <—  0,  Xo  <—  {0},  i  t—  0 

repeat 

Vi  <—  ortho  (Vk  (Vxi  V/(X (*) 
tSi  •<—  U  Ai 
Mi  t-  argmin^  ||y  -  «4.(X(i)  -  £ ^ Vf(X(i))) \\\ 

y(i)MX(i)-f?s,v/(x(i)) 

TV(i)  Vk{V(i))  with  Wi  <-  ortho(W(i)) 

&  <-  argmin5  ||y  -  A(W(i)  -  § Vwi'Vf(W(i 

X(i  +  1)  <-  W(i)  -  | VWiVf{W{i))  with  A- 
i  4 —  i  T  1 

until  |jX(i)  —  X(i  —  1) || 2  <  ?)||-X(i)||2  or  Maxlterations. 


(Best  rank-k  subspace  orthogonal  to  Xi) 
(Active  subspace  expansion) 

\\Vs.Vf(X(i))U2F  .  . 

(SteP  slze  selectton ) 

(Error  norm  reduction  via  gradient  descent ) 
(Best  rank-k  subspace  selection ) 

\\Vw.Vf(W[i))\\2F 


WAVw.vftwamJ 
ortho(X(i  +  1)) 


(Step  size  selection) 
(De-bias  using  gradient  descent) 


Algorithm  1:  Matrix  ALPS  1 


Input:  y ,  A,  k.  Tolerance  ;/,  Maxlterations 

Initialize:  X(0)  ■(—  0,  X0  ■(—  {0},  i  0 

repeat 

1: 

V,  <-  ortho(pfc(P4V/(X(*)))) 

(Best  rank-k  subspace  orthogonal  to  Xi) 

2: 

S%  4—  Vi  U  Xi 

(Active  subspace  expansion) 

3: 

V(i)  4-  argmhv:Vespan(5.)  \\y  -  AV\\\ 

(Error  norm  reduction  via  least-squares  optimization) 

4: 

X(i  +  1)  4—  Vk(V(i))  with  Xi+ 1  <—  ortho (X(i  +  1)) 
i  4 —  i  -\- 1 

until  H.X’(i)  —  X(i  —  1) || 2  <  77 1| -XI (z) || 2  or  Maxlterations. 

(Best  rank-k  subspace  selection) 

Algorithm  2:  ADMiRA  Instance 


A.  Precedence 

Explicit  descriptions  of  the  proposed  algorithms  are  provided  in  Algorithms  1  and  2,  in  pseudocode  form.  Algorithm  1 
follows  from  the  ALgrebraic  Pursuits  (ALPS)  scheme  for  the  vector  case  [31].  MATRIX  ALPS  I  provides  efficient  strategies  for 
adaptive  step  size  selection  and  additional  signal  estimate  updates  at  each  iteration  (these  motions  are  explained  in  detail  in  the 
next  subsection).  Algorithm  2  (ADMiRA)  [21]  further  improves  the  performance  of  Algorithm  1  by  introducing  least  squares 
optimization  steps  over  restricted  subspaces — this  technique  borrows  from  a  series  of  vector  reconstruction  algorithms  such  as 
CoSaMP  [32],  Subspace  Pursuit  (SP)  [33]  and  Hard  Thresholding  Pursuit  (HTP)  [34],  To  start-off,  we  first  derive  conditions 
under  which  MATRIX  ALPS  I  and  ADMiRA  algorithms  recover  the  underlying  low  rank  matrix,  in  terms  of  R-RIP  constant 
bounds.  The  resulting  conditions  are  competitive  with  the  state-of-the-art  approaches  [5],  [21]. 

In  a  nutshell,  both  algorithms  simply  seek  to  improve  the  subspace  selection  by  iteratively  collecting  an  extended  subspace 
Si  with  \S,,\  <  2k  and  then  finding  the  rank-fc  matrix  that  fits  the  measurements  in  this  restricted  subspace  using  least  squares 
techniques  or  gradient  descent  motions. 

Similarly  to  the  measurement-optimal  greedy  algorithms  for  the  sparse  vector  reconstruction  problem  [32],  [33],  our  method  is 
a  first-order  gradient  descent  algorithm;  hence,  it  requires  the  computation  of  the  gradient  V/(-),  a  superlinear  runtime  operation 
as  a  function  of  0(mn).  [24]  provides  an  efficient  randomized  scheme  with  sublinear  time  complexity  for  the  vector  analogue 
problem. 

B.  Algebraic  Pursuits  in  a  nutshell 

At  each  iteration,  the  Algorithms  1  and  2  perform  motions  from  the  following  list: 


1)  Best  rank-k  subspace  orthogonal  to  Xi  and  active  subspace  expansion:  We  identify  the  best  rank-fc  subspace  of  the 
current  gradient  V f(X(i)),  orthogonal  to  Xi  and  then  merge  this  low-rank  subspace  with  Xi.  This  motion  guarantees  that, 
at  each  iteration,  we  expand  the  current  rank- A:  subspace  estimate  with  k  new,  rank-1  orthogonal  subspaces  to  explore, 
avoiding  premature  termination  of  the  algorithm. 

2a)  Error  norm  reduction  via  greedy  descent  with  adaptive  step  size  selection  (Algorithm  1 ):  We  decrease  the  data  error 
by  performing  a  single  gradient  descent  step  over  the  objective  function.  This  scheme  is  based  on  a  one-shot  step  size 
selection  procedure  (Step  size  selection  step) — detailed  description  of  this  approach  is  given  in  Section  IV. 

2b)  Error  norm  reduction  via  least  squares  optimization  (Algorithm  2):  We  decrease  the  data  error  f(X)  as  much 
as  possible  on  the  active  0(fc)-low  rank  subspace.  Assuming  A  is  well-conditioned  over  low-rank  subspaces,  the  main 
complexity  of  this  operation  is  dominated  by  the  solution  of  a  symmetric  linear  system  of  equations. 
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3)  Best  rank-k  subspace  selection:  We  project  the  constrained  solution  onto  the  set  of  rank-fc  matrices  Ck  '■=  {A  £ 
7 ZmXn  :  rank(A)  <  k}  to  arbitrate  the  active  support  set.  This  step  is  calculated  in  polynomial  time  complexity  as  a 
function  oftnxn  using  SVD  or  other  matrix  rank-revealing  decomposition  algorithms — further  discussions  about  this  step 
and  its  approximations  can  be  found  in  Sections  VIII  and  IX. 

4)  De-bias  using  gradient  descent  ( Algorithm  1):  We  de-bias  the  current  estimate  W(i)  by  performing  an  additional 
gradient  descent  step,  decreasing  the  data  error.  The  step  size  selection  procedure  follows  the  same  motions  as  in  2a). 


IV.  Ingredients  for  hard  thresholding  methods 

A.  Step  size  selection 

To  emphasize  how  step  size  selection  pi  affects  the  convergence  rate  of  Algorithm  1,  we  provide  ideas  on  selecting  the  step 
size  pi  adaptively. 

For  the  vector  case  (1),  recent  works  on  the  performance  of  IHT  algorithm  provide  strong  convergence  rate  guarantees  in 
terms  of  R-RIP  constants;  c.f.  [35].  However,  as  a  prerequisite  to  achieve  these  strong  isometry  constant  bounds,  the  step  size  is 
set  pi  =  1 ,  Vi,  given  that  ||<J?||2  <  1  [34] — similar  analysis  can  be  found  in  [5]  for  the  matrix  case.  From  a  different  perspective, 
[36]  proposes  a  constant  step  size  pi  =  1/(1  +  821c),  Vi,  for  the  vector  case,  based  on  a  simple  convergence  analysis  of  the 
gradient  descent  method. 

Unfortunately,  most  of  the  above  problem  assumptions  are  not  naturally  met;  the  authors  in  [37]  provide  an  intuitive  example 
where  IHT  algorithm  behaves  differently  under  various  scalings  of  the  sensing  matrix  $ — similar  counterexamples  can  be 
devised  for  the  matrix  case.  Violation  of  these  configuration  details  usually  lead  to  unpredictable  signal  recovery  performance 
of  hard  thresholding  methods.  Therefore,  more  sophisticated  step  size  selection  procedures  should  be  devised  to  tackle  these 
computational  issues  during  actual  recovery.  On  the  other  hand,  the  computation  of  RIP  constants  has  exponential  time  complexity 
for  the  vector  case  strategy  of  [36]  and  exhaustive  combinatorial  search  is  necessary. 

Existing  approaches  broadly  fall  into  two  categories:  constant  and  adaptive  step  size  selection.  In  this  work,  we  present  efficient 
strategies  to  adaptively  select  the  step  size  pi  that  implies  fast  convergence  rate — without  violating  the  R-RIP  conditions  on  A. 
Constant  step  size  strategies  easily  follow  from  [23]  and  are  not  listed  in  this  work. 

Adaptive  step  size  selection.  There  is  limited  work  on  the  adaptive  step  size  selection  for  hard  thresholding  methods.  To 
the  best  of  our  knowledge,  apart  from  [23],  [37]-  [38]  are  the  only  studies  that  attempt  this  via  line  searching  for  the  vector 
case — no  references  are  available  for  the  matrix  case. 

According  to  Algorithm  1,  let  X(i)  £  'JZmxn  be  the  rank-fc  matrix  estimate  with  subspace  spanned  by  the  set  of  orthonormal, 
rank-1  matrices  in  Xi,  at  the  i-th  iteration.  Using  regular  gradient  descent  motions,  the  new  rank-fc  estimate  W(i)  can  be 
calculated  through: 

Vt  =  X(i)-^f(X(i)),  W(i)  =  Vk(V(i)),  (21) 

with  Wi  <—  ortho(W(i)).  It  then  holds  that  the  subspace  spanned  by  W)  originates:  i)  either  from  the  subspace  of  Xi,  ii)  or 
from  the  best  subspace  (in  terms  of  the  Frobenius  norm  metric)  of  the  current  gradient  V  f(X(i)),  orthogonal  to  Xi,  in)  or 
from  the  combination  of  orthonormal,  rank-1  matrices  lying  on  the  union  of  the  above  two  subspaces.  The  statements  above 
can  be  summarized  in  the  following  expression: 

span(Wi)  €  span  (Vi  U  Xi)  (22) 

for  any  step  size  pi  and  77,  :=  ortho  (Vk  (VxX  f(X(i)))j .  Since  rank(span(Wi))  <  k,  we  easily  deduce  the  following  key 
observation: 


Remark  7.  Let  Si  be  a  set  of  rank- 1  matrices  where  rank(span(Si ))  <  2k,  defined  as  follows: 

Si  =  Vi  U  Xi.  (23) 

Given  Wi  is  unknown  before  the  i-th  iteration,  Si  spans  the  smallest  subspace  that  contains  Wi  such  that  the  following  equality 


Vk  ( X(i)  -  £  V/(X(0)  )=Vk{  x(i)  -  ^Ps,Vf(X(i 


Pi 


necessarily  holds. 

To  compute  step-size  pi,  we  use: 

Pi  =  argmin  | y  -  A(X(i)  -  ^ VsJ(X(i 
LL  \  Z 


,2  \\Vs,Vf(X(i))fF 
12  ||^V/(X(f))|||’ 


(24) 


(25) 


i.e.,  \Xi  is  the  minimizer  of  the  objective  function,  given  the  current  gradient  V/(X(z)).  Note  that: 

1  —  ^2fc(*4.)  <  —  <  1  +  $2 


(26) 
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due  to  R-R1P — i.e.,  we  select  2k  subspaces  such  that  /i;  satisfies  the  R-RIP  condition.  We  can  derive  similar  arguments  for  the 
additional  step  size  selection  £,  in  Step  6  of  Algorithm  1. 

We  observe  that  adaptive  /j,  scheme  results  in  more  restrictive  “worst-case”  isometry  constants  compared  to  [5],  [34],  [39], 
but  faster  convergence  and  better  stability  are  empirically  observed  in  general.  Figures  l(a)-(b)  illustrate  some  characteristic 
examples.  The  performance  varies  for  different  problem  configurations.  For  p,  >  1,  SVP  diverges  for  various  test  cases.  We 
note  that,  for  large  fixed  matrix  dimensions  m,  n,  adaptive  step  size  selection  becomes  computationally  expensive  compared  to 
constant  step  size  selection  strategies,  as  the  rank  of  X*  increases. 


B.  Updates  over  Restricted  Subspaces 

In  Algorithm  1,  at  each  iteration,  the  new  estimate  W (i)  <—  Vk  ( V (*))  can  be  further  refined  by  applying  a  single  or  multiple 
gradient  descent  updates  with  line  search  restricted  on  W;  [34]  (Step  7  in  Algorithm  1): 


X(i  +  1)  <-  W(i) 


where  £*  = 


\Wwyf(wm\i 

\\AVWlVf(Wm\l 


In  spirit,  the  gradient  step  above  is  the  same  as  block  coordinate  descent  in  convex  optimization  except  that  we  find  the  subspaces 
adaptively  (almost  greedily).  Figure  1(c)  depicts  the  acceleration  achieved  by  using  additional  gradient  updates  over  restricted 
low-rank  subspaces. 


C.  Memory-based  acceleration 

Memory-based  techniques  can  be  used  to  improve  convergence  speed  and  stability.  We  keep  the  discussion  on  memory 
utilization  for  Section  VII  where  we  presnt  a  new  algorithmic  framework  for  low-rank  matrix  recovery. 

V.  Convergence  guarantees 

In  this  section,  we  present  the  theoretical  convergence  guarantees  of  Algorithms  1  and  2  as  functions  of  R-RIP  constants. 
To  characterize  the  performance  of  the  proposed  algorithms,  both  in  terms  of  convergence  rate  and  noise  resilience,  we  use  the 
following  recursive  expression: 

||X(i  +  1)  —  X*  ||f  <  p\\X(i)  —  X*  ||f  +  7||e||2-  (27) 

In  (27),  7  denotes  the  approximation  guarantee  and  provides  insights  into  algorithm’s  reconstruction  capabilities  when  additive 
noise  is  present;  p  <  1  expresses  the  convergence  rate  towards  a  region  around  X*,  whose  radius  is  determined  by  ||e || 2. 
In  short,  (27)  characterizes  how  the  distance  to  the  true  signal  X*  is  decreased  and  how  the  noise  level  affects  the  accuracy  of 
the  solution,  at  each  iteration. 


A.  Matrix  ALPS  I 

An  important  lemma  for  our  derivations  below  is  given  next: 


Lemma  6.  [Active  subspace  expansion]  Let  X (i)  £  7?7n><"  ^g  ^g  matrjx  estimate  at  the  i-th  iteration  and  let  Xi  be  a  set 
of  orthonormal,  rank-1  matrices  such  that  Xi  ortho(X(i)).  Then,  at  each  iteration,  the  Active  Subspace  Expansion  step  in 
Algorithms  1  and  2  identifies  information  in  X* ,  such  that: 

\\Vx*\SiX*\\F  <  (282k(A)  +  253fc(.4.))|| X (i)  —  +  \/2(l  +  52fc(«4.))||e||2,  (28) 

where  Si  =  Xi  U  XL  and  X*  <—  ortho(X*). 


Lemma  6  states  that,  at  each  iteration,  the  active  subspace  expansion  step  identifies  a  2k  rank  subspace  in  7\)mXn  such  that 
the  amount  of  unrecovered  energy  of  X* — i.e..  the  projection  of  X*  onto  the  orthogonal  subspace  of  span(<Si) — is  bounded 
by  (28). 

Then,  Theorem  1  characterizes  the  iteration  invariant  of  Algorithm  1  for  the  matrix  case: 

Theorem  1.  [Iteration  invariant  for  MATRIX  ALPS  1/  The  ( i  +  l)-th  matrix  estimate  X(i  +  1)  of  MATRIX  ALPS  I  satisfies 
the  following  recursion: 

j|X(t  +  1)  -  X*\\F  <  p\\X(i)  -  X'\\F  +  7||e||a,  (29) 

^ere  p  :=  +  (2 S2k(A)  +  263k(A))^&)  and 


7  :  = 


/  1  +  282k  (A)  \  /  2  y/l  +  82k  {A) 
'  1  —  $2k  (A)  /  V  1  —  82k  (A) 


+  l~St^A)V2{1  +  52k{A))) 


+ 


\/l  +  8k  (A) 
1  —  5k  (A) 


(30) 


Moreover,  when  53k{A)  <  0.1235,  the  iterations  are  contractive. 
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Fig.  1.  Median  error  per  iteration  for  various  step  size  policies  and  20  Monte-Carlo  repetitions.  In  brackets,  we  present  the  mean  time 
consumed  for  convergene  in  seconds,  (a)  m  =  n  =  2048,  p  =  0.4n2,  and  rank  k  =  70 — A  is  formed  by  permuted  and  subsampled 
noiselets  [40],  (b)  m.  =  2048  ,  n  =  512,  p  =  0.4n2,  and  rank  k  =  50 — we  use  underdetermined  linear  map  A  according  to  the  MC 
problem  (c)m  =  2048,  n  =  512,  p  =  0.4n2 .  and  rank  k  =  40 — we  use  underdetermined  linear  map  A  according  to  the  MC  problem. 


To  provide  some  intuition  behind  this  result,  assume  that  X*  is  a  rank-fc  matrix.  Then,  according  to  Theorem  1,  for  p  <  1, 
the  approximation  parameter  7  in  (29)  satisfies: 

7  <  5.7624,  for  53k(A)  <  0.1235.  (31) 

Moreover,  we  derive  the  following: 

P  <  (i  +  fa(A))2  +  8 523k(A))  <  i  =►  S3k(A)  <  0.079,  (32) 

which  is  a  stronger  R-RIP  condition  assumption  compared  to  state-of-the-art  approaches  [21].  In  the  next  section,  we  further 
improve  this  guarantee  using  Algorithm  2. 

Unfolding  the  recursive  formula  (28),  we  obtain  the  following  upper  bound  for  ||X(®)  —  ,X*||F  at  the  i-th  iteration: 

||X(*)-*‘||F<  p^lxto)  +  (33) 

Then,  given  X(0)  =  0,  MATRIX  ALPS  I  finds  a  rank-fc  solution  X  £  7 ZmXn  such  that  \\X  —  JA*||F  <  7+i~p  ||e||2  after 
*  iterations. 

If  we  ignore  the  steps  5  and  6  in  Algorithm  1,  we  obtain  another  projected  gradient  descent  variant  for  the  affine  rank 
minimization  problem,  for  which  we  obtain  the  following  performance  guarantees — the  proof  follows  from  the  proof  of  Theorem 
1. 


Corollary  1.  /MATRIX  ALPS  I  Instance]  In  Algorithm  I,  we  ignore  steps  5  and  6  and  let  X{i  +  1)  -A-  Vk(Vi)  with 
Xi+ 1  ■£-  ortho(X(i  +  1))  in  step  4.  Then,  using  same  analysis,  we  observe  that  the  following  recursion  is  satisfied: 

1 1 -XX®  +  1)  —  X  ||F  <  p||X(i)  —  X  ||F+7||e||2>  (34) 

for  p  :=  (^04  +  (2 52k(A)  +  253k(A))^^ 
p  <  1  when  53k(A)  <  0.1594. 

We  observe  that  the  additional  estimate  update  over  restricted  support  sets  results  in  more  restrictive  isometry  constants 
compared  to  Theorem  1.  In  practice,  additional  updates  result  in  faster  convergence  and  more  stable  signal  reconstruction,  as 
shown  in  Figure  1(c). 

B.  ADMiRA  Instance 

In  MATRIX  ALPS  I,  the  gradient  descent  steps  constitute  a  first-order  approximation  to  least-squares  minimization  problems. 
Replacing  Step  4  in  Algorithm  1  with  the  following  optimization  problem: 

V  (*)•<—  argmin  |y  —  AV  1 1 2  >  (35) 

V :  V  £span(«Si) 

we  obtain  ADMiRA  (furthermore,  we  remove  the  de-bias  step  in  Algorithm  1).  Assuming  that  the  linear  operator  A,  restricted 
on  sufficiently  low-rank  subspaces,  is  well-conditioned  in  terms  of  the  R-RIP  assumption,  the  optimization  problem  (35)  has  a 


j  and  7  ^ 


_ (  2\/ l+<52fc(»4) 

1  — ^2  fc(^) 


7^77^)  7/2(1  +  82k{A))y  Moreover, 
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unique  optimal  minimizes  By  exploiting  the  optimality  condition  in  Lemma  5,  ADMiRA  instance  in  Algorithm  2  features  the 
following  guarantee: 


Theorem  2.  [Iteration  invariant]  The  (i+l)-th  matrix  estimate  X(i+1)  of  ADMiRA  answers  the  following  recursive  expression: 


where 


and 


|x(i  +  i)-jr||F<pj|x(*)-x‘||F  +  7|| 

p  :=  (2S2k{  A)  +  283k{A))  i 


1  + 

1  -  5*k(A)  ’ 


7  : 


Moreover,  when  83k {A)  <  0.2267,  the  iterations  are  contractive. 


+  y/3)  \/l  +  82k (A). 


(36) 


(37) 


Similarly  to  MATRIX  ALPS  I  analysis,  the  parameter  7  in  Theorem  2  satisfies: 

7  <  5.1848,  for  83k{A)  <  0.2267.  (38) 

Furthermore,  to  compare  the  approximation  guarantees  of  Theorem  2  with  [21],  we  further  observe: 

83k(A)  <  0.1214,  for  p  <  1/2.  (39) 

We  remind  that  [21]  provides  convergence  guarantees  for  ADMiRA  with  84  k  (*4.)  <  0.04  for  p  =  1/2. 


VI.  Complexity  Analysis 

In  each  iteration,  computational  requirements  of  the  proposed  hard  thresholding  methods  mainly  depend  on  the  total  number 
of  linear  mapping  operations  A.  gradient  descent  steps,  least-squares  optimizations  and  matrix  decompositions  for  low  rank 
approximation.  Different  algorithmic  configurations  (e.g.  removing  steps  6  and  7  in  Algorithm  1)  lead  to  hard  thresholding  variants 
with  less  computational  complexity  per  iteration  and  better  R-RIP  conditions  for  convergence  but  a  degraded  performance  in 
terms  of  stability  and  convergence  speed  is  observed  in  practice.  On  the  other  hand,  these  additional  processing  steps  increase 
the  required  time-complexity  per  iteration;  hence,  low  iteration  counts  are  desired  to  trade-off  these  operations. 

A  non-exhaustive  list  of  linear  map  examples  includes  the  identity  operator  (Principal  component  analysis  (PCA)  problem), 
Fourier/Wavelets/Noiselets  tranformations  and  the  famous  Matrix  Completion  problem  where  A  is  a  mask  operator  such  that 
only  a  fraction  of  elements  in  X  is  observed.  Assuming  the  most  demanding  case  where  A  and  A *  are  dense  linear  maps  with 
no  structure,  the  computation  of  the  gradient  V  f(X(i))  at  each  iteration  requires  O(pfcmn)  arithmetic  operations. 

Given  a  set  S  of  orthonormal,  rank-1  matrices,  the  projection  VsX  for  any  matrix  X  £  77.mxrl  neetjs  0(m2n )  time 
complexity  as  a  matrix-matrix  multiplication  operation.  In  MATRIX  ALPS  I,  the  adaptive  step  size  selection  steps  require  the 
calculation  of  p:  and  £,  quantities  in  0(mm{pkmn,m2n})  time  complexity.  In  ADMiRA  solving  a  least-squares  system 
restricted  on  rank-2fc  and  rank-fc  subspaces  requires  0{pk2)  complexity — according  to  [32],  [21],  the  complexity  of  this  step 
can  be  further  reduced  using  approximation  techniques  such  as  the  Richardson  method  or  conjugate  gradients  algorithm. 

Contrariwise  to  the  vector  case  where  the  projection  onto  the  union  of  sparse  vector  subspaces  can  be  easily  computed 
via  sorting  the  signal  coefficients,  the  best  projection  of  an  arbitrary  matrix  onto  the  set  of  low  rank  matrices  requires  more 
sophisticated  linear  algebra  matrix  decompositions  such  as  SVD.  Using  the  Lanczos  approach,  we  require  Ofkmn)  arithmetic 
operations  to  compute  a  rank-A;  matrix  approximation  for  a  given  constant  accuracy — a  prohibitive  time-complexity  that  does 
not  scale  well  for  many  practical  applications.  Sections  VIII  and  IX  describe  approximate  low  rank  matrix  projections  and  how 
they  affect  the  convergence  guarantees  of  the  proposed  algorithms. 

Overall,  in  MATRIX  ALPS  I,  the  operation  that  dominates,  with  respect  to  the  total  number  of  operations  at  each  iteration, 
requires  0(min{pkmn,m2n})  time  complexity  while  ADMiRA  requires  Ofpkmn )  time  complexity  per  iteration. 


VII.  Memory  Utilization 


Iterative  algorithms  can  use  memory  to  gain  momentum  in  convergence.  The  success  of  the  memory-based  approaches  depends 
on  the  iteration  dependent  momentum  term  by  leveraging  previous  estimates.  Based  on  Nesterov’s  optimal  gradient  methods, 
we  propose  a  hard  thresholding  variant,  described  in  Algorithm  3 — the  vector  case  variant  was  previously  proposed  in  [41], 
Similarly  to  pi  strategies,  n  can  be  preset  as  constant  or  adaptively  computed  at  each  iteration.  Constant  momentum  step  size 
selection  has  no  additional  computational  cost  but  convergence  rate  acceleration  is  not  guaranteed  for  some  problem  formulations. 
On  the  other  hand,  empirical  evidence  has  shown  that  adaptive  n  selection  strategies  result  to  faster  convergence  compared  to 
zero-memory  methods  with  similar  complexity. 

For  the  case  of  strongly  convex  objective  functions,  Nesterov  [42]  proposed  the  following  constant  momentum  step  size 
selection  scheme:  n  =  ,  where  ao  £  (0, 1)  and  cti+i  is  computed  as  the  root  6  (0, 1)  of 


q|+i  =  (1  —  ai+i)aj  +  qai+ 1,  for  q  = 


2  (A) 


^n(A) 

O-max(A)  ’ 
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Input:  y.  A,  k.  Tolerance  Maxlterations 

Initialize:  X(0)  <—  0,  Xq  <—  {0},  Q( 0)  <—  0,  Qo  «—  {0}, 

Ti  Vi,  i  <—  0 

1 

repeat 

Vi  «-  ortho (Vk(TQXf(Q(i)))) 

(Best  rank-k  subspace  orthogonal  to  Qi) 

2 

Si  <—  Vi  U  Qi 

(Active  subspace  expansion) 

3 

Pi  £-  argminM  \\y  -  A(Q(i)  -  f VSi  V/(Q(i)))  | \l  = 

\\rSivf(Q(i))\\2F  .  . 

\Avs.vf(Q(i))Wi  (SteP  s,ze  Nation) 

4 

V(i)t-Q(i)-^P5iV/(Q(i)) 

(Error  norm  reduction  via  gradient  descent) 

5 

X{i+l)^Tk(V(i)) 

(Best  rank-k  subspace  selection) 

6 

Q(i  +  1)  t-  X(i  +  1)  +  Ti(X{i  +  1)  -  X(i)) 

(Momentum  update ) 

7 

Qi+i  <-  ortho (A”i  U  Xi+i) 

i  «—  i  +  1 

until  || JSC (i)  —  X(i  —  1) || a  <  »)||X(i)||2  or  Maxlterations. 

Algorithm  3:  MATRIX  ALPS  11 


where  k(A)  denotes  the  condition  number  of  A  and  am m(A),amla(A)  denote  the  minimum  and  maximum  singular  values  of 
A.  In  this  scheme,  exact  calculation  of  q  parameter  is  computationally  expensive  for  large-scale  data  problems  and  approximation 
schemes  are  leveraged  to  compensate  this  complexity  bottleneck. 

Based  upon  the  similar  ideas  as  adaptive  p-i  selection,  we  propose  to  select  n  as  the  minimize!'  of  the  objective  function: 


n  =  arg  min  \\y  —  AQ(i  +  1)  ||1 


(y  -  AX(i),AX(i)  -  AX (i-  1)) 
\\AX(i)  -  AX(i-  1)||| 


(40) 


where  AX(i),AX(i  —  1)  are  previously  computed.  According  to  (40),  n  is  dominated  by  the  calculation  of  a  vector  inner 
product,  a  computationally  cheaper  process  than  q  calculation.  Convergence  rate  performance  of  the  above  schemes  is  depicted 
in  Fig.  2(a)  for  the  vector  case  (1)  [23]. 

Theorem  3  characterizes  Algorithm  3  for  constant  momentum  step  size  selection.  To  keep  the  main  ideas  simple,  we  ignore 
the  additional  gradient  updates  in  Algorithm  3.  In  addition,  we  only  consider  the  noiseless  case  for  clarity.  The  convergence 
rate  proof  for  these  cases  is  left  to  the  reader. 


Theorem  3.  [Iteration  invariant  for  MATRIX  ALPS  II 7  Let  y  =  AX*  be  a  noiseless  set  of  observations.  To  recover  X*  from 
y  and  A  the  (i  +  1  )-th  matrix  estimate  X(i  +  1)  of  MATRIX  ALPS  II  satisfies  the  following  recursion: 

|| X(i  +  1)  -  X*\\F  <  a(l  +  Ti)j|X(i)  -  X*||F  +  aTi\\X(i  -  1)  -  X*\\F,  (41) 

where  a  :=  +  (2<53fc(«4.)  +  25  ik  (A))  •  Moreover,  solving  the  above  second-order  recurrence,  the  following 

inequality  holds  true: 


|X(i  +  l)-X*|L< 


t(l  +  Ti)  +  \/a2(l  +  n)2  +  4an 


i+ 1 


|X(0)-X* 


(42) 


Theorem  3  provides  convergence  rate  behaviour  proof  for  the  case  where  n  is  constant  VI.  The  more  elaborate  case  where 
Ti  follows  the  policy  described  in  (40)  is  left  as  an  open  question  for  future  work.  To  provide  some  insight  for  (42),  for 
Ti  =  1/4,  Vi  and  Ti  =  1/2,  Vi,  <S4fc  (A)  <  0.1187  and  S^kiA)  <  0.095  guarantee  convergence  in  Algorithm  3,  respectively. 
Moreover,  Figure  2(b)  shows  acceleration  in  practice  compared  to  the  zero-memory  case  (MATRIX  ALPS  I). 


VIII.  Accelerating  Matrix  ALPS:  ^Approximation  of  SVD  via  Column  Subset  Selection 
A  time-complexity  bottleneck  in  the  proposed  schemes  is  the  computation  of  the  singular  value  decomposition  to  find 
subspaces  that  describe  the,  yet,  unexplored  information  in  matrix  X* .  Unfortunately,  following  the  Eckart- Young  theorem, 
the  computational  cost  of  SVD  for  best  subspace  tracking  is  prohibitive  for  many  applications. 

Based  on  [43],  [44],  we  can  obtain  randomized  SVD  approximations  of  a  matrix  X  £  7?.mxn  using  column  subset  selection 
ideas:  given  X,  we  compute  leverage  scores  for  each  column  that  represent  their  “significance”.  In  particular,  we  define  a 
probability  distribution  that  weights  each  column  depending  on  the  amount  of  information  they  contain — usually,  the  distribution 
is  related  to  the  £2 -norm  of  the  columns.  The  main  idea  of  this  approach  is  to  compute  a  surrogate  rank-fc  matrix  'Pj.(X)  by 
subsampling  the  columns  according  to  this  distribution.  It  turns  out  that  the  total  number  of  sampled  columns  is  a  function  of 
the  parameter  e.  Moreover,  [45],  [46]  proved  that,  given  a  target  rank  k  and  an  approximation  parameter  e,  we  can  compute  an 
e-approximate  rank-fc  matrix  'Pl  (X)  according  to  the  following  defintion. 


Definition  4.  [e-approximate  low-rank  projection]  Let  X  £  -fcmXn  ye  an  arbitrary  matrix.  Then,  'Pj.(X)  projection  provides 
a  rank-k  matrix  approximation  to  X  such  that: 


'Pk(X)  -  X\\2p  <  (l  +  e)\\Vk{X)  -  X 


(43) 
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(a)  (b) 


Fig.  2.  (a)  Convergence  rate  example  using  memory  for  the  vector  case  in  (1 )  with  n  =  2000,  p  =  600  and  sparsity  level  s  =  120.  Blue 


and  black  lines  represent  Nesterov’s  Ti  selection  scheme  with  q  =  stndq  ~  ,  respectively;  green  line  represents  the  proposed 


momentum  step  size  selection,  (b).  Median  error  per  iteration  for  various  momentum  step  size  policies  and  SO  Monte-Carlo  repetitions. 
Here,  m  =  n  =  256,  p  =  0.4n2,  and  rank  k  =  22.  We  use  permuted  and  subsampled  noiselets  for  the  linear  map  A.  In  brackets,  we 
present  the  median  time  for  convergence  in  seconds. 


where  Vk(X)  =  argminV:ra„t(v)<fc  \\X  -  Y||F. 

Using  e-approximation  schemes  to  perform  the  Active  subspace  selection  Step,  the  following  upper  bound  holds.  The  proof 
is  provided  in  the  Appendix: 

Lemma  7.  [e-approximate  active  subspace  expansion]  Let  X(i)  £  7£m><n  £,e  the  matrix  estimate  at  the  i-th  iteration  and  let 
Xi  be  a  set  of  orthonormal,  rank-1  matrices  such  that  Xi  <—  ortho(X  (i)).  Furthermore,  let  T>\  :=  ortho(vk  (fPxi  Vf(X(i)))) 

be  a  set  of  orthonormal,  rank-1  matrices  that  span  rank-k  subspace  such  that  (43)  is  satisfied  for  X  :=  Vx  X  f(X(i))  .  Then, 
at  each  iteration,  the  Active  Subspace  Expansion  step  in  Algorithms  1  and  2  captures  information  contained  in  the  true  matrix 
X* ,  such  that: 

llTVvs.X* ||F  <  (2 S2k(A)  +  2 S3k{A)  +  v^(l  +  2 52k(A)  +  5k{A))\\X(i)  -  X‘||F 

+  \/  2(1  +  <52fe(»4.))||e||2  +  sfe^\Px’’\xi'A. '  e||F,  (44) 

where  Si  =  Xi  U  D]  and  X*  ■<—  ortho(X*). 


Furthermore,  to  prove  the  following  theorems,  we  extend  Lemma  (11)  as  follows.  The  proof  easily  follows  from  the  proof  of 
Lemma  (11),  using  Definition  (4): 

Lemma  8.  [e-approximation  rank-k  subspace  selection]  Let  V  ( i )  £  TV"’ x  be  a  rank-‘2k  proxy  matrix  in  the  subspace  spanned 
by  Si  and  let  W(i)  «—  Vk(V(i))  denote  the  rank-k  e-approximation  to  V(i),  according  to  (13).  Then: 

\\w(i)  -  V{i)\\l  <  (1  +  e)||W(0  -  V(i)||F  <  (1  +  e)|| VSi[V{i)  -  X*)\\F  <  (1  +  e)||V(i)  -  X*||F.  (45) 

where  W(i)  £-  Vk(V(i)). 


A.  MATRIX  ALPS  I  using  e-approximate  low-rank  projection  via  column  subset  selection 

Using  e-approximate  SVD  in  MATRIX  ALPS  I,  the  following  iteration  invariant  theorem  holds: 


Theorem  4.  [Iteration  invariant  with  e-approximate  projections  for  MATRIX  ALPS  \]  Assume  ||»4*e||  <  A  for  some 

constant  A  >  0.  The  ( i  +  1  )-th  matrix  estimate  X(i  +  1)  of  MATRIX  ALPS  I  with  e-approximate  projections  T>\  •<— 
ortho(vk(VxN7 /(-X’(i)))^  and  W(i)  £-  Vk(V(i))  in  Algorithm  1  satisfies  the  following  recursion: 

||X(*  +  1)  -  X*||F  <  p\\X({)  -  X*||F  +7||e||2  +/3A,  (46) 

where 


P  ■=  (1  + 


3  8k(A.) 

1  -  5k(A) 


)(2  +  , 


1  + 


53fc(-4.) 

1  —  82  k(A) 


^  (4^4 A)  +  s/e(l  +  3c>2fc(«4.))^ 


2&2k  (  A) 

1  —  82  k{A) 


(47) 
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(c) 


Fig.  3.  (a)-(b)Trade-off  curve  between  e-approximation  parameter  and  convergence  conditions  in  terms  of  4k  (A)  RIP  constant,  (c) 
Performance  comparison  using  e-approximation  SVD  [46]  in  MATRIX  ALPS  II.  m  =  n  =  256,  p  =  0.4 n2 ,  rank  of  X*  equals  2  and 
A.  constituted  by  permuted  noiselets.  The  non-smoothness  in  the  error  curves  is  due  to  the  extreme  low  rank-ness  of  X*  for  this  problem 
setting. 


and 


/?:= 


34(A) 

1  —  4(A) 


)  (2  +  e)  (1.  + 


4fc(A)  \ 
1  —  4  k  (A) ) 


Vi, 


7  := 


34(A)  \ 
1  -  4(A) ) 


(2  +  e)  [(x  + 


4fe(A) 

1  —  4fc(A) 


)  x/  2(1  +  4fe(A))  +  2 


\/l  +  4fe(A)  i 
1  —  4  k  (  A)  J 


(48) 


(49) 


Figure  3(a)  shows  the  trade-off  between  approximation  parameter  e  and  necessary  R-RIP  conditions  on  4fc(A)  such  that 
p  <  1.  We  observe  that  the  space  (4fc(A),  e)  £  (0, 1)  x  (0, 1)  is  partitioned  into  two  regions,  defining  a  phase  transition  curve 
on  the  boundary  of  the  two  parts.  We  note  that,  due  to  different  convergence  analysis  compared  to  Section  V,  MATRIX  ALPS 
I  in  Theorem  4  satisfies  p  <  1  for  e  =  0  if  and  only  ;/4fc(A)  <  0.0637 — this  complies  with  Figure  3(a)  on  the  vertical  axis 
for  e  =  0. 


B.  ADMiRA  using  e-approximate  low-rank  projection  via  column  subset  selection 
Similarly,  the  following  theorem  holds  for  ADMiRA  using  approximate  SVDs: 

Theorem  5.  [Iteration  invariant  with  e-approximate  projections  for  ADMiRA]  Assume  ||A*e||F  <  A  for  some  constant  A  >  0. 
The  {i+l)-th  matrix  estimate  X(i+ 1)  of  ADMiRA  in  Algorithm  2  with  e-approximate  projections  3—  ortho(vk(Vx.V  f(X(i 
and  X(i  +  1)  <—  V%{V (?’))  answers  the  following  recursive  expression: 

II X(i  +  1)  -  X*||F  <  p\\X(i)  -  X*\\F  +7|H|f  +/3A, 


where 


p  := 


1  +  (1  +  e  +  2vT  -(-  e)6% fc(A) 


1  ~  523k(A) 

/3\=sfe- 


(24k  (A)  +  24fc  (A)  +  s/e(l  +  24fc  (A)  +  4  (A)) , 


1  +  (1  +  e  +  2V1  +  e)6%  fc(A) 


1  ~S23k(A) 


7  ^/l  +  (1  +  e  +  2VTTl)5lk{A)  (l  +  ^  ^  ' 

We  need  the  following  lemma  to  prove  the  result  above — a  detailed  proof  can  be  found  in  the  Appendix. 


(50) 


(51) 


(52) 


Lemma  9.  Let  V(i)  be  the  least  squares  solution  in  Step  3  of  the  ADMiRA  algorithm  and  let  X(i  +  1)  be  a  proxy,  rank-k 
matrix  to  V(i)  according  to:  X(i  +  1)  «—  Vk(jV{i)).  Then,  ||X(i  +  1)  —  X*||  can  be  expressed  in  terms  of  the  distance 
from  V(i)  to  X*  as  follows: 


|X(*  +  1)-X* 


<  \Jl  +  ((1  +  e)  +  2\J\  +  e)<5|fe(A)  ^||  V(i)  —  X*||F  +  y 


((1  +  e)  +  2^/1  +  e)  (l  +  4fc(A)) 
1  +  ((1  +  e)  +  2%/l  +  t)5lk(A) 


(53) 
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Input:  y.  A,  k,  q.  Tolerance  q,  Maxlterations 

Initialize:  X(0)  4—  0,  Xo  4—  {0},  Q( 0)  4—  0,  Qo  <—  {0},  n  Vi,  i  4—  0 

repeat 

2 ),  f-  RANDOMIZEDPoWERlTERATION("Pg.  V/(Q(i)),  fc,  g)  (Rank-k  subspace  via  Randomized  Power  Iteration) 
Si  4—  XL  U  Qi  (Active  subspace  expansion ) 


XL  U  Q 

Pi  4—  argmin^  \\y-A(Q(x)-^Vsyf(Q(i 

v(i)^Q(i)-fvsyf(Q(i)) 

U  4—  RandomizedPowerIteration  (V (i) ,  k ,  5) 
X(i  +  l)4-PwV(i) 

Q(i  +  1)  4-  X(i  +  1)  +  Ti(X(i  +  1)  -  X(i)) 

Qi+ 1  4-  ortho(T'i  U  Af+i) 
i  4 —  i  +  1 

until  H-X’(i)  —  X(i  —  1) || 2  <  ?)||-X(*)||2  or  Maxlterations. 


(Step  size  selection) 
(Error  norm  reduction  via  gradient  descent) 
(Rank-k  subspace  via  Randomized  Power  Iteration ) 
(Best  rank-k  subspace  selection ) 
(Momentum  update ) 


Algorithm  4:  Randomized  MATRIX  ALPS  II  with  QR  Factorization 


Then,  the  proof  of  Theorem  5  easily  follows  by  combining  Lemma  7 — 10.  Following  the  same  process  as  in  MATRIX  ALPS 
I  algorithm,  we  easily  derive  the  trade-off  curve  as  depicted  in  Figure  3(b).  The  same  remarks  apply  here  as  in  the  MATRIX 
ALPS  I  case. 

To  illustrate  the  impact  of  SVD  e-approximation  on  the  signal  reconstruction  performance  of  the  proposed  algorithms,  we 
replace  the  best  rank-fc  projections  in  steps  1  and  5  of  Algorithm  1  by  the  e-approximation  SVD  algorithm,  presented  in  [46], 
In  this  paper,  the  column  subset  selection  algorithm  satisfies  the  following  theorem: 

Theorem  6.  Let  X  £  pirrlXn  ye  a  signaI  of  interest  with  arbitrary  rank  <  min{m,  n}  and  let  Xk  represent  the  best  rank-k 
approximation  of  X.  After  2 (k  +  l)(log (k  +  1)  +  1)  passes  over  the  data,  the  Linear  Time  Low-Rank  Matrix  Approximation 
algorithm  in  [46]  computes  a  rank-k  approximation  Vl(X)  £  /JZrnxn  such  that  Definition  4  is  satisfied  with  probability  at  least 
3/4. 

The  proof  of  Theorem  6  is  provided  in  [46],  In  total,  Linear  Time  Low-Rank  Matrix  Approximation  algorithm  [46]  requires 
0(mn(k/e  +  k2  log  k )  +  ( m  +  n)(k2 /e2  +  fc3  log  fc/e  +  k4  log2  k))  time-complexity  and  0(min{m,  n}(fc/e  +  k2  log  fc))  space 
complexity.  However,  while  column  subset  selection  methods  such  as  [46]  reduce  the  overall  complexity  of  low-rank  projections 
in  theory,  in  practice  this  applies  only  in  very  high-dimensional  settings.  To  strengthen  this  argument,  in  Figure  3(c)  we  compare 
SVD-based  MATRIX  ALPS  II  with  MATRIX  ALPS  II  using  the  e-approximate  column  subset  selection  method  in  [46],  We 
observe  that  the  total  number  of  iterations  for  convergence  increases  due  to  e-approximate  low-rank  projections,  as  expected. 
Nevertheless,  we  observe  that,  on  average,  the  column  subset  selection  process  [46]  is  computationally  prohibitive  compared  to 
regular  SVD  calculation  due  to  the  time  overhead  in  the  column  selection  procedure — fewer  passes  over  the  data  are  desirable 
in  practice  to  trade-off  the  increased  number  of  iterations  for  convergence.  In  the  next  Section,  we  present  alternatives  based  on 
recent  trends  in  randomized  matrix  decompositions  and  how  we  can  use  them  in  low-rank  recovery. 

IX.  Accelerating  Matrix  ALPS:  SVD  Approximation  using  Randomized  Matrix  Decompositions 

Finding  low-cost  SVD  approximations  to  tackle  the  above  complexity  issues  is  a  challenging  task.  Recent  works  on  probabilistic 
methods  for  matrix  approximation  [25]  dictate  a  family  of  efficient  approximate  projections  on  the  set  of  rank-deficient  matrices 
with  clear  computational  advantages  over  regular  SVD  computation  in  practice  and  attractive  theoretical  guarantees.  In  this 
work,  we  elaborate  over  the  low-cost,  power-iteration  subspace  tracking  scheme,  described  in  Algorithms  4.3  and  4.4  in  [25]. 
Our  proposed  algorithm  is  described  in  Algorithm  4. 

The  convergence  guarantees  of  Algorithm  4  follow  the  same  motions  described  in  Section  VIII,  where  e  is  a  function  of 
m,  n,  k  and  q.  An  extensive  theoretical  study  on  randomized  low-rank  approximations  and  their  impact  in  the  ARM  problem 
is  left  for  future  work. 


X.  Experiments 

A.  List  of  algorithms 

In  the  following  experiments,  we  compare  algorithms  drawn  from  the  following  list:  (i)  the  Singular  Value  Projection  (SVP) 
algorithm  [5],  a  non-convex  first-order  projected  gradient  descent  algorithm  with  constant  step  size  selection  (here,  we  study  the 
case  where  p  =  1),  (ii)  the  inexact  ALM  algorithm  [18]  based  on  augmented  Langrance  multipliers,  (in)  the  OptSpace  algorithm 
[47],  a  gradient  descent  algorithm  on  the  Grassmann  manifold,  ( iv )  the  Grassmannian  Rank-One  Update  Subspace  Estimation 
(GROUSE)  and  the  Grassmannian  Robust  Adaptive  Subspace  Tracking  (GRASTA)  algorithsm  [48],  [49],  two  stochastic  gradient 
descent  algorithms  that  operate  on  the  Grassmannian — moreover,  to  allay  the  impact  of  outliers  in  the  subspace  selection  step, 
GRASTA  incorporates  the  augmented  Lagrangian  of  Xi-norm  loss  function  into  the  Grassmannian  optimization  framework,  (v) 
the  Riemannian  Trust  Region  Matrix  Completion  (RTRMC)  algorithm  [50],  a  matrix  completion  method  using  first-  and  second- 
order  Riemannian  trust-region  approaches,  (vi)  the  Low  rank  Matrix  Fitting  algorithm  (LMatFit)  [51],  a  nonlinear  successive 
over-relaxation  algorithm  and  ( vii )  the  algorithms  MATRIX  ALPS  I,  ADMiRA  [21],  MATRIX  ALPS  II  and  Randomized  MATRIX 
ALPS  II  with  QR  Factorization  (referred  shortly  as  MATRIX  ALPS  II  with  QR)  presented  in  this  paper. 
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B.  Implementation  details 

To  properly  compare  the  algorithms  in  the  above  list,  we  preset  a  set  of  parameters  that  are  common.  We  denote  the  ratio 
between  the  number  of  observed  samples  and  the  number  of  variables  in  X*  as  SR  :=  p/(m  ■  n)  (sampling  ratio).  Furthemore, 
we  reserve  FR  to  represent  the  degree  of  freedom  in  a  rank-fc  matrix  to  the  number  of  observations — this  corresponds  to  the 
following  definition  FR  :=  ( k(m  +  n  —  k))/p.  In  most  of  the  experiments,  we  fix  the  number  of  observable  data  p  =  0.3mn 
and  vary  the  dimensions  and  the  rank  k  of  the  matrix  X*.  This  way,  we  create  a  wide  range  of  different  problem  configurations 
with  variable  FR. 

In  all  algorithms,  we  fix  the  maximum  number  of  iterations  to  700,  unless  otherwise  stated.  To  solve  a  least  squares  problem 
over  a  restricted  low-rank  subspace,  we  use  conjugate  gradients  with  maximum  number  of  iterations  given  by  cg_maxiter  :=  500 
and  tolerance  parameter  cg_tol  :=  10-10.  We  use  the  same  stopping  criteria  for  the  majority  of  algorithms  under  consideration: 


X(i)-X(i-1)||F 

\W)\\f 


<  tol, 


(54) 


where  X(i),  X(i  —  1)  denote  the  current  and  the  previous  estimate  of  X*  and  tol  :=  5  •  10-5.  If  this  is  not  the  case,  we 
tweak  the  algorithms  to  minimize  the  total  execution  time  and  achieve  similar  reconstruction  performance  as  the  rest  of  the 
algorithms.  For  SVD  calculations,  we  use  the  lansvd  implementation  in  PROPACK  package  [52] — moreover,  all  the  algorithms 
in  comparison  use  the  same  linear  operators  A  and  A*  for  gradient  and  SVD  calculations  and  conjugate-gradient  least-squares 
minimizations.  For  fairness,  we  modified  all  the  algorithms  so  that  they  exploit  the  true  rank. 


C.  Limitations  of  •  |  phased  algorithms:  a  toy  example 

While  nucluear  norm  heuristic  is  widely  used  in  solving  the  low-rank  minimization  problem  with  impressive  reconstruction 
performance  in  polynomial  time  cost,  [53]  presents  simple  problem  cases  where  convex,  nuclear  norm-based,  algorithms  fail  in 
practice.  Using  the  ||  ■  ||^-norm  in  the  objective  function  as  the  convex  surrogate  of  the  rank(-)  metric  might  lead  to  a  candidate 
set  with  multiple  solutions,  introducing  ambiguity  in  the  selection  process.  Borrowing  the  example  in  [53],  we  test  the  list  of 
algorithms  above  on  a  toy  problem  setting.  To  this  end,  we  design  the  following  problem:  let  X*  £  X5x4  be  the  matrix  of 
interest  with  rank(X*)  =  2,  as  shown  in  Figure  4(a).  We  consider  the  case  where  we  have  access  to  X*  only  through  a  subset 
of  its  entries,  as  shown  in  Figure  4(b). 
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Fig.  4.  Matrix  Completion  toy  example  for  X*  £  725x4.  We  reserve  ’?’  to  denote  the  unobserved  entried. 


In  Figure  5,  we  present  the  reconstruction  performance  of  various  matrix  completion  solvers  after  300  iterations.  Although 
there  are  multiple  solutions  that  induce  the  recovered  matrix  and  have  the  same  rank  as  X*.  most  of  the  algorithms  in  comparison 
reconstruct  X*  successfully.  We  note  that,  in  some  cases,  the  inadequancy  of  an  algorithm  to  reconstruct  X*  is  not  because  of 
the  (relaxed)  problem  formulation  but  due  to  its  fast — but  inaccurate — implementation  (fast  convergence  versus  reconstruction 
accuracy  trade-off). 


D.  Synthetic  data 

General  affine  rank  minimization  using  noiselets:  In  this  experiment,  the  set  of  observations  y  £  TV’  satisfy: 

y  =  AX*  +  e  (55) 

Here,  we  use  permuted  and  subsampled  noiselets  for  the  linear  operator  A  [10].  The  signal  X*  is  generated  as  the  multiplication 
of  two  low-rank  matrices,  L  £  TVnxk  and  R  £  TZnXk,  such  that  X*  =  LRr  and  ||X* ||  =  1.  Both  L  and  R  have  random 

independent  and  identically  distributed  (iid)  Gaussian  entries  with  zero  mean  and  unit  variance.  In  the  noisy  case,  the  additive 
noise  term  e  £  1ZP  contains  entries  drawn  from  a  zero  mean  Gaussian  distribution  with  ||e||  £  {1CP3, 1CP4}. 

We  compare  the  following  algorithms:  SVP,  ADMiRA,  MATRIX  ALPS  I,  MATRIX  ALPS  II  and  MATRIX  ALPS  II  with  QR 
for  various  problem  configurations,  as  depicted  in  Table  I  (there  is  no  available  code  with  arbitrary  sensing  operators  for  the 
rest  algorithms).  In  Table  I,  we  show  the  median  values  of  reconstruction  error,  number  of  iterations  and  execution  time  over 
50  Monte  Carlo  iterations.  For  all  cases,  we  assume  SR  =  0.3  and  we  set  the  maximum  number  of  iterations  to  700.  Bold  font 
denotes  the  fastest  execution  time.  Furthermore.  Figure  6  illustrates  the  effectiveness  of  the  algorithms  for  some  representative 
problem  configurations. 

In  Table  6,  MATRIX  ALPS  II  and  MATRIX  ALPS  II  with  QR  obtain  accurate  low-rank  solutions  much  faster  than  the  rest  of 
the  algorithms  in  comparison.  In  high  dimensional  settings,  MATRIX  ALPS  II  with  QR  scales  better  as  the  problem  dimensions 
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Fig.  5.  Toy  example  reconstruction  performance  for  various  algorithms.  We  observe  that  X *  is  an  integer  matrix — since  the  algorithms 
under  consideration  return  real  matrices  as  solutions,  we  round  the  solution  elementwise. 

TABLE  I 

General  ARM  using  Noiselets. 


Configuration 

FR 

SVP 

ADMiRA 

Matrix  ALPS  I 

m 

n 

k 

~MT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

256 

512 

5 

0 

0.097 

38 

-7 

1 

o 

CM 

CM 

0.78 

27 

4.4  •  10“a 

2.26 

13.5 

1  ■  10_a 

0.7 

256 

512 

5 

10“a 

0.097 

38 

6 ■ 10-4 

0.91 

700 

2 • 10“a 

65.94 

16 

7-  10“4 

0.92 

256 

512 

5 

10-4 

0.097 

38 

2.1  ■  10“4 

0.94 

700 

4.1  •  10“4 

69.03 

11.5 

7.9  •  10“a 

0.72 

256 

512 

10 

0 

0.193 

50 

CO 

O 

1 

1.44 

38 

5 ■ lO'5 

4.42 

13 

3.9  ■  lO-5 

0.92 

256 

512 

10 

10“a 

0.193 

50 

9 ■ 10-4 

1.39 

700 

1.7-  10“a 

56.94 

29 

1.2  •  10~a 

1.78 

256 

512 

10 

10-4 

0.193 

50 

3.5  ■  10~4 

1.38 

700 

9.3  •  10"5 

64.69 

14 

1.4  •  10-4 

0.93 

256 

512 

20 

0 

0.38 

86 

7- 10"4 

3.32 

700 

4.1  •  10“a 

81.93 

45 

2  •  10“4 

4.09 

256 

512 

20 

10“a 

0.38 

86 

1.5  ■  10"a 

3.45 

700 

4.2  •  10~a 

77.35 

69 

2.3  •  10“a 

5.05 

256 

512 

20 

10-4 

0.38 

86 

7- 10"4 

3.26 

700 

4 • 10~a 

79.47 

46 

4-  10-4 

4.1 

512 

1024 

30 

0 

0.287 

66 

CD 

O 

1 

8.79 

295 

5.4  •  10“5 

143.53 

24 

1  ■  10-4 

8.01 

512 

1024 

40 

0 

0.38 

86 

7- lO"4 

10.09 

700 

4.3  •  10~a 

251.27 

45 

2  •  10-4 

11.08 

1024 

2048 

50 

0 

0.24 

57 

1 

O 

CO 

42.88 

103 

5.2  ■  10“5 

312.62 

18 

5.7-  10~a 

35.86 

Matrix  ALPS  II 

Matrix  ALPS  II  with  QR 

m 

n 

k 

iter. 

err. 

time 

iter. 

err. 

time 

256 

512 

5 

0 

0.097 

8 

7.1  ■  10_e 

0.42 

10 

9.1  •  10-b 

0.39 

256 

512 

5 

10“a 

0.097 

9 

7- 10"4 

0.56 

20 

7-  10 

0.93 

256 

512 

5 

10-4 

0.097 

8 

7- 10-5 

0.5 

10 

lO 

1 

o 

00 

0.46 

256 

512 

10 

0 

0.193 

10 

2.3  ■  lO-5 

0.68 

13 

2.4-  10_a 

0.64 

256 

512 

10 

10_a 

0.193 

19 

1 • 10_a 

1.29 

27 

1  •  10 

-3 

1.35 

256 

512 

10 

10-4 

0.193 

10 

1.1  •  10“4 

0.68 

13 

1.1  •  10-4 

0.62 

256 

512 

20 

0 

0.38 

21 

1 ■ lO"4 

1.92 

24 

1  •  10 

1.26 

256 

512 

20 

10“a 

0.38 

36 

1.5  •  10~a 

2.67 

39 

1.5  •  10“a 

1.69 

256 

512 

20 

10-4 

0.38 

21 

2 ■ 10“4 

1.87 

24 

2  ■  10 

1.22 

512 

1024 

30 

0 

0.287 

14 

4.5  ■  lO-5 

4.7 

18 

3.3  •  10-3 

4.15 

512 

1024 

40 

0 

0.38 

21 

1 ■ 10-4 

6.01 

24 

1  •  10 

4.53 

1024 

2048 

50 

0 

0.24 

12 

2.5  ■  10“a 

22.76 

15 

3.3  •  10_a 

17.94 

increase,  leading  to  faster  convergence.  Moreover,  its  execution  time  is  at  least  a  few  orders  of  magnitude  smaller  compared  to 
SVP,  ADMiRA  and  MATRIX  ALPS  I  implementations. 

Robust  matrix  completion:  We  design  matrix  completion  problems  in  the  following  way.  The  signal  of  interest  X*  £  7 ZmXn 
is  synthesized  as  a  rank-fc  matrix,  factorized  as  X*  :=  LRr  with  ||X*||  =  1  where  L  £  an[j  R  £  j^nxk  as  defined 

above.  In  sequence,  we  subsample  X*  by  observing  p  =  0.3mn  entries,  drawn  uniformly  at  random.  We  denote  the  set  of  ordered 
pairs  that  represent  the  coordinates  of  the  observable  entries  as  SI  =  {(hj)  '■  [ A” * ] l:j  is  known}  C  {1, . . . ,  in}  x 
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(d)  (e)  (f) 


Fig.  6.  Low  rank  signal  reconstruction  using  noiselet  linear  operator.  The  error  curves  are  the  median  values  across  50  Monte-Carlo 
realizations  over  each  iteration.  For  all  cases,  we  assume  p  =  0.3 mn.  (a)  m  =  256,  n  =  512,  k  =  10  and  ||e||0  =  10-3.  (b)  m  =  256, 
n  =  512,  k  =  10  and  ||e||2  =  1CT4.  (c)  m  =  256,  n  =  512,  k  =  20  and  ||e||  =  0.  (d)  m  =  512,  n  =  1024,  k  =  30  and  ||e||  =  0.  (e) 
m  =  512,  n  =  1024.  k  =  40  and  ||e||a  =  0.  (f)  in  =  1024,  n  =  2048,  k  =  50  and  ||ef[  =  0. 


and  let  An  denote  the  linear  operator  (mask)  that  samples  a  matrix  according  to  Cl.  Then,  the  set  of  observations  satisfies: 

y  =  AnX*  +  e,  (56) 

i.e.,  the  known  entries  of  X*  are  structured  as  a  vector  y  £  1ZP,  disturbed  by  a  dense  noise  vector  e  £  1ZP  with  fixed-energy, 
which  is  populated  by  iid  zero-mean  Gaussians. 

To  demonstrate  the  reconstruction  accuracy  and  the  convergence  speeds,  we  generate  various  problem  configurations  (both 
noisy  and  noiseless  settings),  according  to  (56).  The  energy  of  the  additive  noise  takes  values  ||e||0  £  { 10— 3 ,  10~4}.  All  the 
algorithms  are  tested  for  the  same  signal-matrix-noise  realizations.  A  summary  of  the  results  can  be  found  in  Tables  II,  III  and, 
IV  where  we  present  the  median  values  of  reconstruction  error,  number  of  iterations  and  execution  time  over  50  Monte  Carlo 
iterations.  For  all  cases,  we  assume  SR  =  0.3  and  set  the  maximum  number  of  iterations  to  700.  Bold  font  denotes  the  fastest 
execution  time.  Some  convergence  error  curves  for  specific  cases  are  illustrated  in  Figures  7  and  8. 

In  Table  7,  LMaFit  [51]  implementation  presents  the  fastest  convergence  for  small  scale  problem  configuration  where  m  =  300 
and  n  =  600.  We  note  that  part  of  LMaFit  implementation  uses  C  code  for  acceleration.  GROUSE  [48]  is  a  competitive  low-rank 
recovery  method  with  attractive  execution  times  for  the  extreme  low  rank  problem  settings  due  to  stochastic  gradient  descent 
techniques.  Nevertheless,  its  execution  time  performance  degrades  significantly  as  we  increase  the  rank  of  X* .  Moreover,  we 
observe  how  randomized  low  rank  projections  accelerate  the  convergence  speed  where  MATRIX  ALPS  II  with  QR  converges 
faster  than  MATRIX  ALPS  II.  In  Tables  III  and  IV,  we  increase  the  problem  dimensions.  Here,  MATRIX  ALPS  II  with  QR  has 
faster  convergence  for  most  of  the  cases  and  scales  well  as  the  problem  size  increases.  We  note  that  we  do  not  exploit  stochastic 
gradient  descent  techniques  in  the  recovery  process  to  accelerate  convergence  which  is  left  for  future  work. 


E.  Real  data 

We  use  real  data  images  to  highlight  the  reconstruction  performance  of  the  proposed  schemes.  To  this  end,  we  perform 
grayscale  image  denoising  from  an  incomplete  set  of  observed  pixels — similar  experiments  can  be  found  in  [51],  Based  on  the 
matrix  completion  setting,  we  observe  a  limited  number  of  pixels  from  the  original  image  and  perform  a  low  rank  approximation 
based  only  on  the  set  of  measurements.  While  the  true  underlying  image  might  not  be  low-rank,  we  apply  our  solvers  to  obtain 
low-rank  approximations. 
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TABLE  II 

Matrix  Completion  problem  for  m  =  300  and  n  =  600.  ”  depicts  no  information  or  not  applicable  due  to  time 

OVERHEAD. 


Configuration 

FR 

SVP 

GROUSE 

TFOCS 

m 

n 

k 

IMT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

300 

600 

5 

0 

0.083 

43 

to 

CO 

O 

1 

4- 

0.59 

- 

1.52  •  10“4 

0.08 

- 

8.69  ■  10“a 

3.36 

300 

600 

5 

10_a 

0.083 

42 

6-  10~4 

0.65 

- 

2 • 10-4 

0.082 

- 

5-  10~4 

3.85 

300 

600 

5 

10-4 

0.083 

43 

3-  10~4 

0.64 

- 

2 • 10“4 

0.079 

- 

1  •  10-4 

3.5 

300 

600 

10 

0 

0.165 

54 

4-  10~4 

0.9 

- 

4.5  ■  10_b 

0.22 

- 

2  ■  10"4 

6.43 

300 

600 

10 

10_a 

0.165 

54 

9-  10~4 

0.89 

- 

2 ■ 10“4 

0.16 

- 

8-  10“4 

7.83 

300 

600 

10 

10“4 

0.165 

54 

4-  10~4 

0.91 

- 

2 • 10-4 

0.16 

- 

1  •  10~4 

6.75 

300 

600 

20 

0 

0.326 

85 

8-  10~4 

2.04 

- 

1 ■ 10-4 

0.81 

- 

2  ■  10"4 

30.04 

300 

600 

40 

0 

0.637 

241 

3.4  ■  10_a 

11.1 

- 

3.1  •  10~a 

13.94 

- 

- 

- 

Inexact  ALM 

OptS  pace 

GRASTA 

m 

n 

k 

1RT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

300 

600 

5 

0 

0.083 

24 

6.7-  lO"5 

0.47 

31 

2.8  •  10~b 

2.41 

- 

2.2  ■  10“4 

2.07 

300 

600 

5 

10“a 

0.083 

24 

6-  10~4 

0.49 

297 

5 ■ lO"4 

22.82 

- 

1  •  10~4 

2.07 

300 

600 

5 

10-4 

0.083 

24 

1  •  10~4 

0.49 

267 

1 ■ 10-4 

21.56 

- 

8- 10“a 

2.1 

300 

600 

10 

0 

0.165 

26 

1  •  10~4 

0.6 

37 

SO 

1 

o 

CO 

OQ 

8.42 

- 

8.6  ■  10-B 

4.5 

300 

600 

10 

10_a 

0.165 

26 

8-  10~4 

0.59 

304 

8 • 10-4 

66.02 

- 

5.5  ■  10~a 

3.43 

300 

600 

10 

10-4 

0.165 

26 

1  •  10~4 

0.61 

304 

1 ■ 10-4 

65.56 

- 

5.3  ■  10"a 

3.44 

300 

600 

20 

0 

0.326 

44 

3-  10~4 

1.37 

- 

- 

- 

- 

5-  lO"4 

10.51 

300 

600 

40 

0 

0.637 

134 

1.6  •  10"a 

7.08 

- 

- 

- 

- 

5.2  ■  10“a 

251.34 

RTRMC 

LMaFit 

Matrix  ALPS  I 

m 

n 

k 

1RT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

300 

600 

5 

0 

0.083 

13 

1.2  ■  10-4 

0.59 

20 

2.2  ■  10-4 

0.054 

22 

1.8  ■  10_a 

0.76 

300 

600 

5 

10“a 

0.083 

13 

1 • 10-4 

0.59 

19 

5 ■ lO"4 

0.049 

37 

7-  10~4 

1.34 

300 

600 

5 

10“4 

0.083 

13 

2  •  10~4 

0.59 

21 

1 ■ lO"4 

0.052 

18 

1  •  10~4 

0.61 

300 

600 

10 

0 

0.165 

16 

1.1  •  10“a 

1.03 

23 

1 ■ 10-4 

0.064 

16 

1  •  10-4 

0.65 

300 

600 

10 

10_a 

0.165 

17 

1  •  10~4 

1.09 

26 

8 • lO"4 

0.077 

30 

1.1  •  10_a 

1.16 

300 

600 

10 

10-4 

0.165 

17 

2  •  10'4 

1.09 

32 

1 • 10-4 

0.097 

16 

1  •  10“4 

0.63 

300 

600 

20 

0 

0.326 

22 

4-  10~4 

2.99 

37 

2 • 10-4 

0.12 

37 

2  ■  10~4 

2.05 

300 

600 

40 

0 

0.637 

35 

3- 10~a 

11.83 

233 

4.9  ■  10“4 

2.52 

500 

6.5  ■  U)-'J 

45.67 

ADMiRA 

Matrix  ALPS  II 

Matrix  ALPS  II  with  QR 

m 

n 

k 

1MT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

300 

600 

5 

0 

0.083 

59 

5.2  ■  10-5 

2.86 

10 

1.7-  lO”5 

0.34 

14 

3.2  ■  10-5 

0.45 

300 

600 

5 

10_a 

0.083 

700 

4- 10~a 

30.96 

12 

6 ■ 10"4 

0.44 

24 

6-  10~4 

0.81 

300 

600 

5 

10-4 

0.083 

700 

4.5  ■  10_a 

31.45 

10 

1 ■ lO"4 

0.36 

14 

1  •  io-4 

0.47 

300 

600 

10 

0 

0.165 

47 

1  •  10'a 

2.56 

12 

3  •  lO"13 

0.48 

16 

3.4  •  10_a 

0.49 

300 

600 

10 

10_a 

0.165 

700 

1.5  ■  10"a 

28.49 

19 

9 ■ 10"4 

0.74 

29 

9-  10"4 

0.95 

300 

600 

10 

10-4 

0.165 

700 

1  •  10~4 

31.99 

12 

1 • 10-4 

0.49 

16 

1  •  10-4 

0.54 

300 

600 

20 

0 

0.326 

700 

1.2  ■  10“a 

41.86 

20 

1 ■ 10-4 

1.16 

23 

1  •  10“4 

0.79 

300 

600 

20 

0 

0.326 

- 

- 

- 

72 

2 • 10“4 

7.21 

68 

2  ■  10“4 

2.6 

Figures  9  and  10  depict  the  reconstruction  results.  In  the  first  test  case,  we  use  a  512  x  512  grayscale  image  as  shown  in  the  top 
left  comer  of  Figure  9.  For  this  case,  we  observe  only  the  35%  of  the  total  number  of  pixels,  randomly  selected — a  realization  is 
depicted  in  the  top  middle  plot  in  Figure  9.  In  sequel,  we  fix  the  desired  rank  to  k  =  40.  The  best  rank-40  approximation  using 
SVD  is  shown  in  the  top  right  corner  of  Figure  9  where  the  full  set  of  pixels  is  observed.  Given  a  fixed  common  tolerance  and 
the  same  stopping  criteria,  Figure  9  shows  the  recovery  performance  achieved  by  a  range  of  algorithms  under  consideration  for 
10  Monte-Carlo  realizations.  We  repeat  the  same  experiment  for  the  second  image  in  Figure  10.  Here,  the  size  of  the  image  is 
256  x  256,  the  desired  rank  is  set  to  k  =  30  and  we  observe  the  33%  of  the  image  pixels.  In  constrast  to  the  image  denoising 
procedure  above,  we  measure  the  reconstruction  error  of  the  computed  solutions  with  respect  to  the  best  rank-30  approximation 
of  the  true  image.  In  both  cases,  we  note  that  MATRIX  ALPS  II  has  a  better  phase  transition  performance  as  compared  to  the 
rest  of  the  algorithms. 


XI.  Conclusions 

In  this  paper,  we  present  some  new  strategies  and  also  review  some  existing  ones  for  hard  thresholding  methods  for 
recovering  low-rank  matrices  from  dimensionality  reducing,  linear  projections.  These  methods  exploit  further  problem  structure 
in  optimization  to  reduce  computational  complexity  without  sacrificing  stability. 
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TABLE  HI 

Matrix  Completion  problem  for  m  =  700  and  n  =  1000.  ”  depicts  no  information  or  not  applicable  due  to  time 

OVERHEAD. 


Configuration 

FR 

SVP 

Inexact  ALM 

GROUSE 

m 

n 

k 

1MT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

700 

1000 

5 

0 

0.04 

34 

1.9 

■  10-4 

1.77 

23 

6.5  •  lO”3 

1.69 

- 

3.5  ■  lO"13 

0.23 

700 

1000 

5 

10-3 

0.04 

34 

4.2 

■  10-4 

1.92 

23 

CO 

PI 

o 

1 

p 

1.87 

- 

3.1  •  10~4 

0.24 

700 

1000 

30 

0 

0.239 

61 

4.6 

■  10-4 

6.39 

29 

1.2  ■  10~4 

3.91 

- 

3.2  ■  lO-3 

3.15 

700 

1000 

30 

10~a 

0.239 

61 

1.1 

■  10~a 

6.33 

29 

1  •  10~a 

3.87 

- 

8  •  10“4 

3.14 

700 

1000 

50 

0 

0.393 

95 

8.5 

■  10-4 

14.47 

49 

3.2  •  10~4 

9.02 

- 

1.3  ■  lO-3 

10.31 

700 

1000 

50 

10_a 

0.393 

95 

1.6 

■  10_a 

15.15 

49 

1.4  •  10_a 

9.11 

- 

8  •  10“4 

10.34 

700 

1000 

110 

0 

0.833 

683 

1.2 

■  10-2 

253.1 

374 

5.8  •  10_a 

152.61 

- 

1.2  •  10_i 

110.93 

700 

1000 

110 

10~a 

0.833 

682 

1.3 

■  IQ-'-1 

256.21 

374 

6.8  •  10”a 

154.34 

- 

1.05  •  10”1 

111.05 

LMaFit 

Matrix  ALPS  II 

Matrix  ALPS  II  with  QR 

m 

n 

k 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

700 

1000 

5 

0 

0.04 

24 

7.2 

•  10“° 

0.67 

8 

1.5  •  10~a 

1.15 

15 

8.3  ■  lO"5 

1.05 

700 

1000 

5 

10~a 

0.04 

17 

3.7 

■  10-4 

0.5 

10 

4.5  •  10“4 

1.38 

15 

3.8  •  10~4 

1.1 

700 

1000 

30 

0 

0.239 

34 

9.2 

■  10“° 

1.95 

14 

4.5  •  10“a 

3.69 

35 

1.1  •  10~4 

2.6 

700 

1000 

30 

10~a 

0.239 

30 

1  • 

10~a 

1.71 

25 

1.1  •  10~a 

6.1 

35 

1  •  10“a 

2.61 

700 

1000 

50 

0 

0.393 

53 

2.7 

•  10_E> 

4.59 

25 

iC 

1 

o 

CD 

00 

8.87 

57 

1.6  •  10-& 

4.47 

700 

1000 

50 

10~a 

0.393 

52 

1.4 

•  10_a 

4.53 

40 

1.6  •  10~3 

14.38 

57 

1.4  ■  10~a 

4.49 

700 

1000 

110 

0 

0.833 

584 

9- 

IP3 

101.95 

280 

8  •  lO"4 

214.93 

553 

7-  10-4 

51.72 

700 

1000 

110 

10~a 

0.833 

584 

3.7 

■  10_a 

102.15 

336 

4.7-  10“a 

261.98 

551 

CO 

o 

1 

u 

51.62 

TABLE  IV 

Matrix  Completion  problem  for  m  =  500  and  n  =  2000.  ”  depicts  no  information  or  not  applicable  due  to  time 

overhead. 


Configuration 

FR 

SVP 

Inexact  ALM 

GROUSE 

m 

n 

k 

1MT 

iter. 

err. 

time 

iter. 

err. 

time 

iter. 

err. 

time 

500 

2000 

30 

0 

0.083 

64 

5.3 

•  10-4 

10.18 

32 

1.9  ■  10“4 

6.47 

- 

1.6  ■  10“4 

2.46 

500 

2000 

30 

10~a 

0.083 

64 

1.1 

•  10_a 

6.69 

32 

1 • 10_a 

4.51 

- 

6  •  10-4 

1.94 

500 

2000 

30 

10~4 

0.083 

64 

5.4 

•  10~4 

10.14 
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Fig.  7.  Low  rank  matrix  recovery  for  the  matrix  completion  problem.  The  error  curves  are  the  median  values  across  50  Monte-Carlo 
realizations  over  each  iteration.  For  all  cases,  we  assume  p  =  0.3  nm.  (a)m  =  300,  n  =  600,  k  =  5  and  ||e||0  =  0.  (b)m  =  300,  n  =  600, 
k  =  20  and  ||e||  =  10-4. 


(a)  (b) 


(c) 


(d)  (e) 


(f) 


Fig.  8.  Low  rank  matrix  recovery  for  the  matrix  completion  problem.  The  error  curves  are  the  median  values  across  50  Monte-Carlo 
realizations  over  each  iteration.  For  all  cases,  we  assume  p  =  0.3 mn.  (a)  m  =  700,  n  =  1000,  k  =  30  and  ||e||2  =  0.  (b)  m  =  700, 
n  =  1000,  k  =  50  and  ||e(L  =  10“3.  (c)  m  =  700,  n  =  1000,  k  =  110  and  ||e|L  =  0.  (d)  m  =  500,  n  =  2000,  k  =  10  and  ||e||  =  0. 
(e)  m  =  500,  n  =  2000,  k  =  50  and  ||e|j  =  10~3.  (f)m  =  500,  n  =  2000,  k  =  80  and  ||e[L  =  10~4. 


In  theory,  constant  p,i  selection  schemes  are  accompanied  with  strong  RIP  constant  conditions  but  empirical  evidence  reveal 
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Fig.  9.  Reconstruction  performance  in  image  denoising  settings.  The  image  size  is  512  x  512  and  the  desired  rank  is  preset  to  k  =  40. 
We  observe  35%  of  the  pixels  of  the  true  image.  We  depict  the  median  reconstruction  error  with  respect  to  the  true  image  in  dB  over 
10  Monte  Carlo  realizations. 


signal  reconstruction  vulnerabilities  for  deviations  from  the  initial  problem  assumptions.  While  convergence  derivations  of 
adaptive  schemes  are  characterized  by  weaker  bounds,  the  performance  gained  by  this  choice  in  terms  of  convergence  rate, 
is  quite  significant.  Memory-based  methods  lead  to  convergence  speed  with  (almost)  no  extra  cost  on  the  complexity  of  hard 
thresholding  methods — theoretical  eveidence  prove  the  efficiency  of  memory  utilization  in  signal  recovery  but  more  theoretical 
justification  is  neeed  as  future  work.  Lastly,  further  estimate  refinement  over  sparse  support  sets  using  gradient  update  steps  or 
pseudoinversion  optimization  techniques  provides  signal  reconstruction  efficacy,  but  more  computational  power  is  needed  per 
iteration. 

Affine  rank  minimization  on  real  data  deals  with  very  large  matrices  which,  in  many  cases,  is  impossible  to  load  into 
the  Random  Access  Memory  (RAM)  of  a  computer;  therefore,  even  first-order  gradient  descent  procedures  are  prohibitively 
expensive  and  require  huge  processing  power  and  memory  storage  restricting  the  application  of  these  algorithms  only  on  small¬ 
sized  matrices.  Recent  developments  on  geometric  functional  analysis  have  shown  encouraging  results  dictating  that  sampling 
from  large  matrices  can  approximate  efficiently  large  data  sets  with  small  error  in  terms  of  the  Frobenius  norm.  In  this  work, 
we  connect  e-approximation  low-rank  revealing  schemes  with  first-order  gradient  descent  algorithms  to  solve  general  affine  rank 
minimization  problems — to  the  best  of  our  knowledge,  this  is  the  first  attempt  to  theoretically  characterize  the  performance  of 
iterative  greedy  algorithms  with  e-approximation  schemes. 
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Fig.  10.  Reconstruction  performance  in  image  denoising  settings.  The  image  size  is  256  x  256  and  the  desired  rank  is  preset  to  k  =  30. 
We  observe  33%  of  the  pixels  of  the  best  rank-30  approximation  of  the  image.  We  depict  the  median  reconstruction  with  respect  to  the 
best  rank-30  approximation  in  dB  over  10  Monte  Carlo  realizations 


In  all  cases,  experimental  results  illustrate  the  effectiveness  of  the  proposed  schemes  on  different  problem  configurations. 
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Appendix 


A.  Proof  of  Lemma  6 

Given  A*  :=  ortho(X*),  we  define  the  following  quantities:  Si  :=  A,  U  Vi,  S*  :=  A,  U  A*.  Then: 


7,5,\5*  —  V'Di\(x”\jxi),  and  Vs*\Si  —  'Px*\(Diuxi)- 


(57) 
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Since  the  subspace  defined  in  T>i  is  the  best  rank-fc  subspace,  orthogonal  to  the  subspace  spanned  by  Xt,  the  following  holds 
true: 


||^A*4V/(X(t))||“  >  ||P*.\*4V/(X(t))||’  =>  (58) 

\\VXyf(X(i))\\2F  +  ||^4^4V/(X(t))||“  >  ||7>*4V/(X(t))|£  +  ||^,v^V/(X(i))||^  =*  (59) 

||Ps4V/(X(t))||’  >  \\Ps?Vf(X{x))\\2F  (60) 

Removing  the  common  subspaces  in  <S,  and  S* ,  we  get 

||p5A5.v/(X(*))||2f  >  ||psnSiv/(X(i))||2F  =>  (6i) 

\\VSi\s*A*A(X*  -  X(i))  +  PS.\S,A*£\\F  >  \\Vs*\stA*A(X*  -  X(i))  +Vst\SiA*e\\F  (62) 

On  the  left  hand  side,  we  have: 

|| VSi\s;A*A(X*  -  X{i))  +  VSi\s*A*e ||F  (63) 

<  || PSi\stA*A(X*  -  X(t))||F  +  ||iP5A5.^l*£||F  (64) 

(=  ||P5i\5*(^*  -  X(t))  +VSi\S;A*A(X*  -  X(i))||F  +  \\PSi\s: A*e\\F  (65) 

(‘=  ||(/  -  Pst\s;A*AVSi\s;)(X*  ~  X(t))  +  rSi\s;  A*  AP£iXS*  (X*  -  X(t))||F  +  ||P5iX5.  A*e\\F  (66) 

<  ||(/  -  VSi\stA*AVSi\sf){X *  -  X(i))||F  +  ||  VSi\stA*  APixst(X*  -  X(i))||F  +  ||pSi\s*^e||F  (67) 

(iv) 

<  <53fe(A)j|X*  -  X(i)|[*  +  || VSi\stA*APii\s:(X*  -  X(i))||F  +  ||Ps4\s4-A*e||F  (68) 

<  <53fe(A)||X*  -  x(*)||F  +  «3fc(X)||^4W(X*  -  x(t))||F  +  ||p5AsrA*£llF  (69) 

( vi ) 

<  253k(A)\\X*  —  X(*)||F  +  ||755i\5* A*e||F  (70) 


where  (i)  due  to  triangle  inequality  over  Frobenius  metric  norm,  (ii)  since  Psi\sf(X(i)  —  X*)  =  0,  (Hi)  by  using  the  fact 
that  X(i)  —  X*  :=  7>si\sf(X(i)  —  X*)  +  Vg.\S*(X(i)  —  X*),  (iv)  due  to  Lemma  3,  (v)  due  to  Lemma  4  and  (vi)  since 

||Pi:\5.(X*-X(i))|iF<  ||X(i)-X*||F.  ‘ 

For  the  right  hand  side  of  (62),  we  calculate: 

\\Ts:\S,A*A(X*  -  X(i))  +Vs*\si  A*4F  (71) 

=  ||Ps,*vs4.A*.A(X*  -X(f))  +  )Ps.\5i(X*-X(f))-P5.XSi(X*-X(f))  +  )Ps.\5i^e||F 
=  || (P s* \Si  A? AP s? \Si  -  I)(X*  -  X(i))  +  Vs*\SlA*AP^\Sz(X*  -  X(i)) 

+  'Ps^\si(X*  —  X(i ))  +  'Ps»\S;«4.*e||f  (72) 

>  ||Ps*\S,(X*  -  X(*))||F  -  ||(^.\54^*^54*\34  -  I)(X*  -  X(t))||F 

-  || Ps;\SiA*AP£.\Si(X*  -  X(i))||F  -  ||73sf \5;»4-*e||F  (73) 

>  ||P5*\5,(X*  -  X(*))||F  -  252k(A)\\X(i)  -  X*  ||F  -  \\Ts:\st  A*e\\F  (74) 

by  using  Lemmas  3  and  4.  Combining  (70)  and  (74)  in  (62),  we  get: 

\\'Px*\(i>iuxi)X  ||F  <  (2S2k(A)  +  2<53fc(«4.))||X(i)  —  X  ||F  +  ||’P(5*\5i)u(5i\-sf )A  e||F  =>  (75) 

||7:,A'*\5i X  ||F  <  ( 282k(A )  +  253k(A)) || X(i)  —  X  ||F  +  \/2(l  +  82k («4.)) ||s|| 2-  (76) 


B.  Proof  of  Theorem  1 

Let  X*  «—  ortho(X*)  be  a  set  of  orthonormal,  rank-1  matrices  that  span  the  range  of  X* .  In  Algorithm  1,  W(i)  is  the  best 
rank-fc  approximation  of  V(i).  Thus: 

\\W(x)-V(i)\\2F<\\X* -V(i)\\2F^  (77) 

||VF(i)  -  X*  +  X*  -  V(i)||*  <  ||x*  -  V(t)||*  =►  (78) 

\\W(i)-X*\\2F  +  \\V(i)-X*\\2F  +  2(W(i)~X*,X* -V(i))  <  ||X*- V(*)||*  =►  (79) 

||W(0  -  X*\\aF  <  2(W(i)  -  X*,  V(i)  -  X*)  (80) 
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From  Algorithm  1,  it  is  obvious  that  i)  V{i)  £  span(<S,),  ii)  X{i)  £  span(iSi)  and  Hi)  W(i)  £  span(<Sj).  We  define 
£  :=  Si  U  X*  where  rank(span(£))  <  3 k  and  let  Vs  be  the  orthogonal  projection  onto  the  subspace  defined  by  £. 

Since  W(i)  —  X*  £  span(£)  and  V(i)  —  X *  £  span(£),  the  following  hold  true: 

W{i)~  X*  =V£{W{i)- X*)  and  V{i)  -  X”  =  Vs{V{i)  -  X*)  (81) 

due  to  Remark  3. 

Then,  (80)  can  be  written  as: 

\\W(i)  -  X*\\l  <2(Ve(W(i)  -  X*),V£(V(i)  -  X*))  =>  (82) 

II w(i)  -  X*\\l  <  2 (V£(W(i)  -  X*),V£  (X(i)  +  HiVSi A* A(X*  -  X(i))  +  mVSiA*e  -  X *)>  =*  (83) 

||W(i)  -  X*||“  <  2 (V£(W(i)  -  X*),V£{X{i)  -  X*  ~  iiiySiAmA{X(i)  -  X*))) 

=A 

+  2pLi(Ve(W(i)-X*),VeVsi(A*e))  (84) 

=B 

In  B,  we  observe: 

B  :=  2 Pi{V£{W{i)  ~  X*),VeVSi{A*e))  =  2fH{W{i)  -  X*,VSi{A*e))  (85) 

(a) 

<  2fM\\W(i)  -  X*\\F\\VsSA*e)\\p  (86) 

(<°  2niS/l  +  52k(A)\\W(i)  -  X*||F||e||2  (87) 

where  (i)  holds  due  to  Remark  3  and  since  Vs{V£  =  VsVsi  =  Vst  for  Si  C  £,  {ii)  is  due  to  Cauchy-Schwarz  inequality 
and,  {in)  is  easily  derived  using  Lemma  1. 

In  A,  we  perform  the  following  motions: 

A  :  =  2 (Vs(W(i)  -  X*),Vs(X(i)  -  X*  -  fSiVs,  A*A(X{i)  -  X*)))  (88) 

=  2 (W(t)  -  X\V£{X{i)  -  X *)  -  mVeVSiA*AVe{X{i)  -  X *)>  (89) 

=  2 (W{i)  -  X*,V£{X{i)  -  X *)  -  (j,iVSiA*  AVs{X(i)  -  X*)}  (90) 

=  2(W(i)  -  X*,Ve{X{i)  -  X*)  -  ^Vs,A*A[VSt  +  V^]V£{X{{)  -  X*)}  (91) 

=  2 (W{i)  -  X*,  (7  -  mVSi  A*AVsz)Vs(X(i)  -  X*)}  -  2 im{W(i)  -  X*,  Vs,  A*  APs,V£{X{i)  -  X*))  (92) 

(a) 

<  2||  W(i)  -  X*||  ||(J  -  imVsi  A*AVSi)Ve{X{i)  -  X*)||F  +  2lM\\W{i)  -  X*||J  VSiA*  AV^Vs{X{i)  -  X*)|| . 


where  (i)  is  due  to  Vs(X(i)  —  X*)  :=  VsiVs(X(i)  —  X*)  +  VgiV£{X{i)  —  X*)  and  {ii)  follows  from  Cauchy-Schwarz 
inequality.  Since  1+s^k^  <  /r,  <  ’  Lemma  3  implies: 


\{I  -  lHVSiA* AVSi)  6  1- 


and  thus: 


1  —  <^2  k{A)  1  +  52k(A) 
1  +  $2k{A)  1  —  52k{A) 


{I  -  iHVSiA* AVSi)Ve{X{i)  -  X*)||F  <  ^^A) 


-1  < 


252k  {A) 

1  —  52k  ( A) 


||7MX(i)-x* 


Furthermore,  according  to  Lemma  4: 


|| VSiA* APiVe{X{i)  -  X*)||F  <  53k(A)\\ViV£{X{i)  -  X*)||F  (96) 

since  rankjPfus.X)  <  3k,  VX  €  TZmXn.  Since  V£.Vs{X{i)  -  X*)  =  Vx.\(piUXi)X*  where  V,  :=  Vk  (V^Vf{X (*))), 
then: 

\\ViVe{X{i)  -  X*)||F  =  \\Vx*\{-DiUXi)X*\\F  <  (2 52k(A)  +  2J3fc(^))||X(t)  -  X*||F  +  V2(l  +  <52fc(^t))||e||2,  (97) 
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using  Lemma  6.  Combining  the  above  in  (93),  we  compute: 

|J(/  -  mVsiA*  AVSi)Ve(X{i)  -  X * 


A  <  2|| W(i)-X 
4<52*(A) 


,  +  2im\\ W(i)  -  X*\\F\\PsiA‘APiVe{X{i)  -  X * 


< 


< 


1  —  <52*  (A) 
4<52*(A) 


|W(*)-X*||Fj|X(i)-X‘||F  + 


2(53*;  (A) 


—  <52*  (A) 


+  (2<52*(A)  +  2<53*(A)) 


F  1  —  fe*  (A) 
2(53*  (A) 


\ViVe{X{i)  -  X*)\\r\\ W{i)-X* 


1  —  <52*  (A) 

elL 


W(t)-X*||J|X(t)-X*| 


(98) 

(99) 

(100) 


X* 


(i 

Combining  (87)  and  (100)  in  (84),  we  get: 

(rSsij +  (2fa(-4)  +  - 

Focusing  on  steps  5  and  6  of  Algorithm  1,  we  perform  the  following  motions: 

||X(i  + 1)  -  x*||F  =  || W(t)  +  iiVWiA*A{X*  -  W(i))  +  ^rWiA*e  -  x*\\F  (102) 

=  \\{l-iiVWiA*A){W{i)-X*)+iiVWiA*e\\F 

=  ||  (I  -  ^iVwiA* APwi)  (W(i)  -  X*)  +£iVvViA*APhi{X*  -  W(i))+V^(x*  ~  W(i))  +  (i?w,  A*e||F 

<  \\{I  -  iiVwtA* AVw%){W(i)  -  X*)\\F  +  tiWVwtA* AV^z{X*  -W{i))\\F 
+  WV^iWii)  -  X*)||F  +^||iPWlA*e||F 

<  ll^tv«(w(0  -  **)lk  +  MA)6||p^(w(i)  -  X*)\\F  + 11^4 (W(t)  -  X*)\\F  +di\\vWiA*s\\F 
^A)11^^  ~  X*)I|f  +(1+  1-5AA))  l|P^(W(i)  -  X*^F  +  6||^WiA*e||F 


1  —  <5*  (A) 

<”>  /  1  +  2^2*  (A) 
1  —  <52*  (A) 


||w(0-**| 


\/l  +  5*  (A)  | 
1  —  <5*  (A) 


(103) 


where  (i)  is  due  to  Lemmas  3  and  4  and  (ii)  is  due  to  Remark  6  and  <5a*(A)  <  <5,9*  (A)  for  a  <  ft,  a,  (5  £  Z+.  Combining 
the  recursions  in  (101)  and  (103),  we  finally  compute: 


\X(i  +  l)-X* 


\F< 


C (t^53)  +  +  )»x«  - 


—  <52*  (A)  - 


X* 


+ 


y  \  1  —  <52*  (A)  /  V  1  —  <52*  (A)  1  —  <52*  (A)  '  1  —  <5*  (A) 


I2' 

(104) 


For  the  convergence  parameter  p,  further  compute: 

"  1  +  2(52* ( A)  \  /  4(52* (A) 


(l  +  2d2k{A)\(  40 2* (A)  rriS  ,  ns.  ,  2d3k{A)  \  1  +  2o3k(A)  (  os2  ,  _  „  /1AS, 

l  1  -  52* (A)  Hi-  52* (A)  +  (2<MA)  +  25sk(A))  1  -  <52* (A) )  -  (1  —  <53*(A))2  (4<MA)  +  8<53*(A))  -  P-  (105) 

for  <5* (A)  <  J2* (A)  <  <53* (A).  Calculating  the  roots  of  this  expression,  we  easily  observe  that  p  <  1  for  <53*(A)  <  0.1235. 


C.  Proof  of  Theorem  2 

Before  we  present  the  proof  of  Theorem  2,  we  list  a  series  of  lemmas  that  correspond  to  the  motions  Algorithm  2  performs — 
detailed  proofs  of  the  following  results  can  be  found  in  the  Appendix. 

Lemma  10.  [ Error  norm  reduction  via  least-squares  optimization]  Let  St  be  a  set  of  orthonormal,  rank-1  matrices  that  span 
a  rank-2k  subspace  in  lZmxn.  Then,  the  least  squares  solution  V(i)  given  by: 

V (f)  4 —  argmin  ||y  —  AV]^,  (106) 

V:VEspan(Si) 
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satisfies: 


||V(t)-X*||F< 


0 -<&(•*) 


II  Vi{V{i)-X* 


+ 


\/l  +  <^2 k(A)  ,|  || 
1-<WA)  l|e|l2‘ 


(107) 


Proof:  We  observe  that  ||V(t)  —  X*||F  is  decomposed  as  follows: 

I|v(i)  -  x‘||2F  =  II vSi(Y(i)  -  x*)||F  +  II vi(v(i)  -  x*)||F.  (108) 

In  (106),  V(i)  is  the  minimizer  over  the  low-rank  subspace  spanned  by  Si  with  rank(span(5i))  <  2k.  Using  the  optimality 
condition  (Lemma  5)  over  the  convex  set  0  =  {X  :  span(X)  £  S;},  we  have: 

(Vf{V(i)),PSi  (X*  -  V(i))>  >  0  =>  (AV(i)  -  y ,  APSl  (V(x)  -  X*))  <  0.  (109) 

for  VsiX*  £  span(5i).  Given  condition  (109),  the  first  term  on  the  right  hand  side  of  (108)  becomes: 

II VSi{V{i)  -  x*)||F  =  (V(i)  -  X*,VsAV(i )  -  x*)}  (110) 

<  (V(i)  -  X\VSi(V{i)  -  X*)}  -  (AV(i)  -  y,AVSi(V(i)  -  X*))  (111) 

=  (V(i)  -  X*,VSi(V(i )  -  X*)}  -  (AV(i)  -  AX*  -  e,AVSi(V(i)  -  X*))  (112) 

=  (V(i)  -  X*,VSi(V(i)  -  x*)}  -  (V(i)  -  X*,  A*  APSi  (V(0  -  X*)}  +  (e,  APSt  ( V(i )  -  X*)> 

(113) 

=  (V(i)  -  X*,(I  -  A* A)Vsi(V(i)  -  X*))  +  ( e,APsz(V(i )  -  X*))  (114) 

<  |(V(i)  -  X*,  (I  -  A*A)VSi(V(i )  -  X*)>|  +  < e,APSi(V(i )  -  X*)}  (115) 

Focusing  on  the  term  |(V(i)  —  X*,  (I  —  A*  A)Vsi  (V(i)  —  X*))|,  we  derive  the  following: 

|<V(i)  -  X*,  (7  -  A*A)PSi(V(i )  -  X*))|  (116) 

=  |(V(i)  -  X*,VSi(V(i)  -  X*)}  -  (V(t)  -  X*,A*AVSi{V(i)  -  X*))|  (117) 

=  \{Pstu x*(V(i)  -  X*),VSi(V(i)  -  X*)}  -  (APSiux*{V{i)  -  X*),AVSi(V{i)  -  X*))|  (118) 

(=  \(VSiux*(V(i)  -  X*),PSiux-PSi(V(i)  -  X*)) 

-  (APsiUX*  (V(i)  -  X*),  AVs^x*  Pst{V(i)  -X*))|  (119) 

=  \(VSiux*  ( V(i )  -  X*),  (/  -  A*  APSiux-)PSi{V{i)  -  X*))|  (120) 

=  |(V(i)  -  X*,  (7  -  PSiux*A* APSiux* )VSi(y(i)  -  X*))|  (121) 

where  ( i )  follows  from  the  facts  that  V(i)  —  X*  £  span(5;  U  X*)  and  thus  VsiUX*  (V(i)  —  X*)  =  V(i)  —  X*  and  (ii)  is 
due  to  VsiUX*Vsi  =  Pst  since  span(<S,)  C  span(<Si  U  X*).  Then,  (115)  becomes: 

||7>Sl(V(i)-X*)||*  (122) 

<  |(V(t)  -  X*,  (7  -  Vsiux*A*AVsiux*)Vs,(V(i)  -  X*)>|  +  (e,APSi(V(i)  -  X*))  (123) 

<  ||V(i)  -  X*\\F\\(I  -  Vsiux*A*APsiux*)Vsi(V(i)  -  X*)\\F  +  \\VSi  A*  e\\F\\PSi(V(i)  -  X*)||F  (124) 

(ii)  _ 

<  S3k(A)\\Vsz  (V(i)  -  X*)||F||  V(i)  -  X*||F  +  v/1  +  52k(A)\\VSz  (V(i)  -  X*)||F||e||2,  (125) 


where  (i)  comes  from  Cauchy-Swartz  inequality  and  (ii)  is  due  to  Lemmas  1  and  3.  Simplifying  the  above  quadratic  expression, 
we  obtain: 


|| VSl(V(i)  -  X*)||F  <  53k(A)\\V(i)  -  X*|fF  +  sjl +  52k(A)\ 
As  a  consequence,  (108)  can  be  upper  bounded  by: 


I2' 


||v(i)  -  x*||F  <  («3*(A)||V(i)  -  x*||F  +  y/l  +  SMA) \\e\\2)2  +  II Pi(V(i)  -  X* 


I2 

If* 


We  form  the  quadratic  polynomial  for  this  inequality  assuming  as  unknown  variable  the  quantity  ||V(«)  —  X* 
by  the  largest  root  of  the  resulting  polynomial,  we  get: 


||V(i)-X* 


< 


y/l-SIkW 


\\Pi(V(i)-X* 


+ 


y/l  +  52k(A)  | 

1  —  63k  (A) 


(126) 

(127) 
|  .  Bounding 

(128) 


The  following  Lemma  characterizes  how  subspace  pruning  affects  the  recovered  energy: 
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Lemma  11.  [Best  rank-k  subspace  selection]  Let  V(i)  £  Xmxn  be  a  rank-2k  proxy  matrix  in  the  subspace  spanned  by  Si 
and  let  X(i  +  1)  -t—  Vk(V(i ))  denote  the  best  rank-k  approximation  to  V{i),  according  to  (13).  Then: 

j|X(i  +  1)  -  V(i)||F  <  || Vs,(V{i)  -  **)||F  <  ||V(i)  -  X*||F.  (129) 

Proof:  Since  X(i  + 1)  denotes  the  best  rank-fc  approximation  to  V ( i ),  the  following  inequality  holds  for  any  rank-fc  matrix 
X  £  7 ZmXn  in  the  subspace  spanned  by  Si,  i.e.  VX  £  span(<Si): 

||X(i  +  l)-V(i)||F<  ||X-V(i)||F.  (130) 

Since  Vs,  V(i)  =  V(£),  the  left  inequality  in  (129)  is  satisfied  for  X  :=  VsiX*  in  (130).  The  second  inequality  in  (129)  holds 
according  to  Remark  6.  ■ 


Lemma  12.  Let  V(i)  be  the  least  squares  solution  in  Step  2  of  the  ADMiRA  algorithm  and  let  X(i  +  1)  be  a  proxy,  rank-k 
matrix  to  V(i)  according  to:  X(i  +  1)  Vk{V(i))-  Then,  |X(i  +  1)  —  X*||F  can  be  expressed  in  terms  of  the  distance 

from  V(i)  to  X*  as  follows: 


||X(»  +  1)  -  X* ||F  <  ^1  +  3sijA)\\V(i)  -  X*||F  +  ^/l  +  3^fc(A)^y++Jf  ||e||a.  (131) 

Proof:  We  observe  the  following 

||X(i  +  1)  -  X*||*  =  ||X(t  +  1)  -  V(t)  +  V(i)  -  X*||*  (132) 

=  ||(V(i)  -  X*)  -  (V(i)  -  X(i  +  1))|£  (133) 

=  ||  V(t)  -  X*||F  +  ||  V (z)  -  X(t  +  1)||*  -  2(V(i)  -  X*,  V(i)  -  X(i  +  1)).  (134) 


Focusing  on  the  right  hand  side  of  expression  (134),  (V (i)  —  X*,  V(i)  —  X(i  +  1))  =  ( V ( i )  —  X* ,Vst  (X (i)  —  X(i  +  1))) 
can  be  similarly  analysed  as  ( 1 17)-(12 1)  where  we  obtain  the  following  expression: 


l<V(t)  ^  X*,VS ,  iV(i)  -  X(i  +  1)))|  <  S3k{A)\\ V(i)  -  X*||F||  V(i)  -  X(i  +  1)||F 

+  V1  +  hk{A)\\V(i)  -  X{i  +  l)||F||e||2.  (135) 

Now,  expression  (134)  can  be  further  transformed  as: 

||x(t  +  1)  -  X*||F  =  ||V(t)  -  X*||F  +  ||V(i)  -  X(i  +  1)||F  -  2 (V(z)  -  X*,  V(i)  -  X(i  +  1))  (136) 

<  ||  V(i)  -  X*||F  +  ||  V(i)  -  X(i  +  1)||F  +  2|(V(t)  -  X*.  V(t)  -  X(t  +  1))|  (137) 

<  || V(i)  -  X*||F  +  || V(i)  -  X(i  +  1)||*  +  2(S3k(A)\\V(if  -  X*||F|| V(i)  -  X(i  +  1)||F  (138) 

+  y/l  +  52k\\V{i)  -  X(i  +  l)||F||e||2)  (139) 

where  ( i )  is  due  to  (135).  Using  Lemma  11,  we  further  have: 

||x(i  + 1)  - x*||*  <  ||V(i)  - X*||*  +  \\Vs,{V(i)  - X*)||F 

+  2(«3fc(^)||V(i)  -  X*\\F\\VSi(V(i)  -  X*)||F  +  Vi  +  52k\\VSi(V(i)  -  X*)||F||e||2)  (140) 


Furthermore,  replacing  ||'P,si(X*  —  V"(*))||F  with  its  upper  bound  defined  in  (126),  we  get: 
||X(i  +  1)  -  X*||*  =  (1  +  34(A))||  V(t)  -  X*||*  +  653fe(A)y/l  +  <52fc(A)||V(£)  -  X*| 


(0 

< 


(l  +  3523k(A))(\\V(i)-X%  +  ^ 


3(1  +  <^2fc(*4.))  | 
1  +  36%k(A) 


|2  +  3(1  +  $2fc(-4.))||£||2 
(141) 


where  (i)  is  obtained  by  completing  the  squares  and  eliminating  negative  terms. 
Applying  basic  algebra  tools  in  (131)  and  (107),  we  get: 


|X(£  +  1)  —  X* 


\F< 


1  +  3^  k(A) 
1  ~S23k(A) 


|  ^(V(t)-X* 


/  y/1  +  3S§k(A) 
'  1  —  53k(A) 


+  v/3 )  \/l  +  82k («4.)||e| 


(142) 


28 


Since  V(i)  €  span(<S;),  we  observe  Vg  (V(i)  —  X*)  =  —VgX*  =  —  Vx*\('Diuxi)X* .  Then,  using  Lemma  6,  we  obtain: 

(2 S2k(A)  +  2S3k{A)) \\X*  -  X(i)||F  +  ^2(1  +  J3fc)||e| 


X(i  +  1)-X*L< 


1  +  3^3&(A) 


V"!  +  35%k{A) 


1  —  ^3fc(A) 

=  (2^2  fc(«4.)  +  2^3fc(^l)) 


+  v 3 )  yl  +  52k (A) ||e 
1  +  3<5f  k  (A) 


l-532fe(A) 


-S23k(A) 

Given  62k  (A)  <  <53^ (A),  p  is  upper  bounded  by: 

p  <  4<53fc(A) 

Then, 

4(53fe(A) 


X*-X(*)||F 

\+-5UA)  V2{1 + *3fc)  +  (^Sir  +  ^ + S2k{A) 


1  +  3Ssk(^) 

1  -  %k(A) 


—  ^3  k{*A) 

1  +  3Ssk  (*4.) 

1  -  <53\U) 

<  1  44  S3k(A)  <  0.2267. 


(143) 


(144) 


(145) 


(146) 


D.  Proof  of  Lemma  7 

Let  T>1  :=  ortho (t>%(J-,xX f{X{i)))^  and  27,  :=  ortho (Vk {V*.  Vf(X(i)))J .  Using  Definition  4,  the  following  holds  true: 


II PvtVf{X(i))  -  V/(X(t))|£  <  (1  +  e)|| rViVf(X(i))  -  V/(X(i)) ||F  (147) 

Furthermore,  we  observe: 

||V/(X(t))||^=  ||V/(X(t))||^  (148) 

\\Vv!Vf(X(i))+V^Vf(X(i))\\2F=  \\Vx.\XiVf{X{i))+v£.\xyf(X(i))\\l*>  (149) 

||^V/(X(0)||“  +  ||^.V/(X(t))||“  =  ||^.\*4V/(X(i))||“  +  ||^.^V/(X(i))||*  (150) 

Since  V-ofS f(X(i))  is  the  best  rank-A;  approximation  to  V f(X(i)),  we  have: 

\\Vvyf(X(i))  -  V/(X(i))||2F  <  \\px,\Xivf{x{i))  -  V/(X(t)) ||F  44  (151) 

||p4v/(X(i))||2F  <  ||^.N*(V/(X(i))||*  (152) 

(l  +  e)||^V/(X(i))||2F  <  (l  +  e)||^.\*jV/(X(t))||“  (153) 

where  rank(span(A'*  \  X, ) )  <  k.  Using  (147)  in  (153),  the  following  series  of  inequalities  are  observed: 

\\Ht V/(X(t))||*  <  (1  +  e)||7?-Di V/(X(*))||F  <  (1  +  e)||^.\^ V/(X(t))||*  (154) 

Now,  in  (150),  we  have: 


||T’-d? v/(x(())||F  +  H'Pb? v/(x(*))||F  =  ||^,^V/(X(*))||F  +  ||^.\^V/(X(*))||F  (143) 
H^V/(X(t))||*  +  (1  +  e)||^,^V/(X(i))||2F  >  \\Vx*\xXf(X(i))\\2F  +  \\V^\xXf(X(if)\\l  44 
ll^i v/(x(t))||“  +  4r**\Xi v/(X(i))||2F  >  \\vx*\Xxf(x(i))\\2F  44 
||TVV/(X(i))||2F  +  \\PXiVf{X(i))\\2F  +  e||^.^V/(X(i))||2F  >  \\Px*\xXf{X(i))\\2F  +  ||7J*iV/(X(t))||’  & 
\\Vsyf(X({))\\2F+e\\V^\xXf(X(i))\\2F  >  \\Vs*Vf(X(i))\\2F  (# 

||PSi\5rV/(X(i))|^  +  e||P^^V/(X(i))||2F>  ||Ps.\5iV/(X(i))||2F44 
II VSi\stA*{y  -  AX(i))W2F  +  e|| V^\XzA\v  -  -4X(i))||F  >  || Vst\s,A*(y  -  AX(i))\\2F  44 
|| VSi\stA*{y  -  XX(i))|[F  +  x/i|| V^\xA*(y  -  AX(i))||F  >  \\rs:^A*{y  -  AX(t)) 


(155) 
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Focusing  on  A*(y-AX(i))  ||F,  we  observe: 

'[\rPx*\xiA*  (y  —  AX(i)) ||F  =  | Vx*\XiA*  [AX  +  e  —  AX  (j))||F  =  || Vx*\XiA'  A{X*  —  X(i))  +  Px*\Xi A  e||F 
=  || 'Px*\xiA*  AVx*\Xi  (X*  —  X(i))  +  Px*\Xi  A  APXi  (X*  —  X(i))  +  Vx*\xA*e ||F 

=  || rx*\x,  A*APx*\xt(X*  -  X(i))  +  VXiA*AVXi(X *  -  X(i))  +  r^UXiAAPx,(X*  -  X(i))  +  V^\XiAe\\ 
<  || V^\XtAAPx*\xdX*  -  X(i))\-.r  +  || VXiA*APXi(X*  -  X(i))||F 
+  W'P x-uXi  A ' APXi(X*  —  X(i))||F  +  1 1 A  e 1 1 F 


where  ( i )  is  due  to  Vx,\Xix  =  [PXi  +  VX-UX .]X,  \/X  €  7 Zn 
In  (156),  we  further  observe: 

•  II  r£.\XiA*APx.\Xi(X*-X(i),llF. 


(156) 


<  82k(A)\\X*  —  X(()||  since  rank(span((A’*\ 


<82k(A)\\Vx*\Xi(X*-X(i)) 

Xi)  UXiU  X*))  <  2k  and  \\Px.\Xi{X*  -  X(i))||F  <  \\X*  -  X(i)| 

"VXiA*APXi(X*  -  X{i))\\F  <  (l  +  5k(A))\\VXl(X(i)-X*)\\F  <  (1  +  4(A))||X(i)  -  X 
P^uxA*APXi(X*-X(i))\\F  <  ||)P^^l*^i(X*-X(i))||F  <  52k(A)\\rXi(X(i)-X*)\\F  <  82k(A)\\X(i)- 


If' 


Then,  (156)  becomes: 

A  ( V  —  AX(i))||F  <  (1  +  2S2k(A)  +  5fe(«4.))||-X’(i)  —  X  ||F  +  ||7:’;t*\;t'i  A  e||F 

Moreover,  we  know  the  following  hold  true  from  Lemma  6: 

|| VSi\s*A*A{X*  -  X{i))  +  VSi\S*A*e\\p  <  253k(A)\\X *  -  X(t)||F  +  \\VSi\s;  A*e\\p 

and 

\\rs*\s,  A’A(X*  -  X(i))  +Vs*^A*e\\F  >  ||pS;\51(-X:*  -  X(i))||F  -  252k(A)\\X(i)  -  X*\\F  -  \\Vst\SiA* 


(157) 

(158) 


IIf 

(159) 


Combining  (157)-(159)  in  (155),  we  obtain: 


11^2 


\S 


X*\\F^\\Vx  .\SiX 


$  ( 282k(A )  +  253k(A)  +  s/e(k  +  282k{A)  +  <5&(A))  |X(i)  —  X 
+  \/2(l  +  <52fc(A))||s||2  +  \/i||'P*»\.V;  A*e||F. 


(160) 


E.  Proof  of  Theorem  4 

To  prove  Theorem  4,  we  combine  the  following  series  of  lemmas  for  each  step  of  Algorithm  1 . 


Lemma  13.  [Error  norm  reduction  via  gradient  descent]  Let  Si  :=  X.t  U  T>\  be  a  set  of  orthonormal,  rank-1  matrices  that  span 
a  rank-2k  subspace  in  lZmxn.  Then: 


||v(*)-jr||F< 


i  + 


83k(A) 


+ 


('  1  1  -  <hk(A) 


^  (252k(A)  +  253k(A)  +  s/e(l  +  282k(A)  +  5fc(A))^  +  y 

1  +  82k  (A 
—  82k(A) 


2.82k  {A) 


—  82k  { A ) 


+(i+i-i^A)^v^A^ 


F  ’ 


\X(i)  -X*\ 


(161) 


Proof:  We  observe  the  following: 

||v(t)  -  x*||*  =  \\VSi{V(i)  -  X*)||“  +  || Vi(V(i)  -  x*)||F 

The  following  equations  hold  true: 

||^(V(i)  -  X*)\\l  =  II Vix%  =  ||^.\5lX*||2F  =  \\Vx*\CDfuXi)(X{i)  -  x*)\\aF 


(162) 

(163) 
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Furthermore,  we  compute: 

II VsAVii)  -  x*)||F  =  \\ps, (X(t)  -  ^ rsyf(x(i ))  -  X*) 


=  || VSi(X(i)  -  X*)  -  Hi'PsiA*A(X(i)  -X*)+  HiVsiA'e ||F 

=  ||  VSi{X(i)  -  X *)  -  HiPsi  A*  APsi  (X  (i)  -  X*)  -  ^Vs,A*AVsl(x(.i)  -  X *)  +  mVStA*e  || 
<  \\(I  -  ^Vs,A*AVs,VsAX(i)  -  X*)||F  +ni\\VsiA*AVi(X{i)  -  X*)||F  +  ^\\PSl  A*e\\p 


W  2 52k(A) 


< 


II VsAx{i)  -  X*)||F  +  \\r£.(X(i)  -  x*)||F  +  ^ 


1  —  S2  k(A) 


\/l  +  52k(A) 


{A) 


,(A) 


(164) 


where  ( i )  is  due  to  Lemmas  1,  3,  4  and  ,  ,  1  ,  . ,  <  u;  <  , — , 1  ,  . 

'  ’  l+°2 k(-A-)  —  —  1  -32fc(-A) 

Using  the  subadditivity  property  of  the  square  root  in  (162),  we  have: 

||v(i)  -  x* ||F  <  || vSi{v(i)  -  x*)||F  + 1| vi{v(i)  -  x*)||F 

<  .  2SMA\.  II Vs,(X(i)  -  x*)||F  +  -  SJks^A)  11^5, (X{i)  -  x * 


1  -S2k(A)' 
\/i  +  $2k(A ) 


,  +  \\vi(v(i)-x* 


(ii) 

< 


l-S2k(A)  ll~"2 

(l  +  x  —  s' k(A) )  +  '2S3k{A)  +  \/e(  1  +  2S2k(A)  +  5k(A))^j  +  ^ ^ ^ 

+  [{1+J^_)VWT^+^^} 


—  52k{A ) ' 

+  (1+1  5-it(A)^r**\x'A*£^ 


\x(i)  -x* 


(165) 


<  ||X(i)  -  X* 


where  (i)  is  due  to  (164)  and  (ii)  is  due  to  Lemma  7  and  \\Vsi(X(i)  —  X* 

We  exploit  Lemma  8  to  obtain  the  following  inequalities: 

||  Wi  -  X*  ||F  =  \\Wi  -  V(i)  +  V(i)  -  X*  ||F  <  ||  Wi  -  V(i)||F  +  ||  V(i)  -  X*||F 

<  (1  +  e)||  W(i)  -  V(i)||F  +  ||V(i)  -  X*||F 

<  (2  +  e)||V(i)-X*|fF  (166) 

where  the  last  inequality  holds  since  W(i)  is  the  best  rank-fe  matrix  estimate  of  V (i)  and,  thus,  ||W(i)  —  V(*)||F  <  I|V(0- 

x*  L. 


Furthermore,  Step  7  in  Matrix  ALPS  I  satisfies  the  following: 

||x(i  + 1)  -  x*  ||F  =  ||iP^(x(i  + 1)  -  x*)||F  +  \\r~(x(i  + 1)  -  x*)||F  => 

||x(i  + 1) - x* ||F  <  || r^(x(i  + 1) - x*)||F  + 1| r~(x(i  + 1) - x*)||F 

In  (165),  we  observe  \\V^(X(i  +  1)  -  X*)||F  =  ||  pi(W,  -  X*)|[F  and 
||^(X(t+l)-X*)||F 

=  || +  etV^A*A(X*  -  Wi)  +  ZiV^Xe  -  X*)||F 

=  W-P^iWi  -  X*)  -  iiP^XAV^(Wi  -  X*)  -  itV^A*AP^(W(i)  -  X*)  +^V^A*e\ 
<  ||(/  -  Zi'PwiA*APfi)Vft(Wi  -  X*)||F  +  t4Vft.A*AP&(Wi  -  X*)||F  +  ^\\P^Afe\\F 


(167) 

(168) 


s  -  *‘>||,  +  -  x-) 


,  + 


\J  1  +  8k(A) 

1  —  Sk(  A) 


2Sk(A)  S2k(A) 


Wi-X* 


\/l  +  8k(A) 


A-Sk(A)  1  -6k(A)J"  ‘  llF  1  1  -Sk(A)  11  112 

since  1+^(-A)  <  £i  <  1_^(iA)  and  both  ||'P^.(Wi-X*)||F  and  \\P~XW X*)\\F  are  less  than  or  equal  to  ||  W,  -X 


(169) 


If' 
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Using  the  above  in  (168),  we  get: 

||X(i  +  l)-X*||  <  (1+  12<yL  +  /2fcA(,fiiv 

11  wf  \  i-«5  k(A)  1  -4(A)' 

Furthermore,  combining  (170),  (166)  and  (165),  we  obtain: 


Wi-X* 


\/ 1  +  4  (A) 
1  —  4(A) 


(170) 


|X(i  +  1)  -  X* 


If< 


('  +  1  -’feW) )  (4i“{A>  +  '/'l1  +  “sit-4)))  + 


24  fc(A) 


—  4fc(A) 


1  —  <52*;  (A) 


|X(i)  -X* 


+  (‘  +  7  + «)  [(' +  OTT>  + 


—  4fc(A) 


—  <52*;  (A) 


e  2 


+  +  r^>  <2 + «>  ('  + 


-4(A) 
since  4(A)  <  4fc(A)  <  4fc(A). 


(171) 


/•.  Proof  of  Lemma  9 

Using  ineq.  (139)  and  Lemma  8,  we  compute  the  following: 

||X(i  +  1)  -  X*||2f  <  ||V(i)  -  X*  j|F  +  (1  +  e)|| TSi{V(i)  ~  X*)||f 

+  253k{A)\\V{i)  -  X*\\FVT+-e\\Vs,(V(i)  -  X*)||F 
+  2vTTev/l  +  82k(A)\\VSi(V(i)  -  X*)||F||e||2  (172) 

Furthermore,  from  (126),  we  further  obtain  in  (172): 

||x(i  +  1)  -  X* ||F  <  ||V(i)  -  X*||F  +  (1  +  e)(4*;(A)|| V(i)  -  X*||F  +  ^1  +  4k(A)||£||2)2 

+  2<53fc(A)||  V(i)  -  X*||FvT+4(4k(A)||  V(i)  -  X*||F  +  y/l  +  fouiA) ||e||2) 

+  2\/l  +  e\/l  +  4fc(  A)  (4fe(A)  ||  V(i)  —  X  || F  +  \]l  +  4  k  (A)  ||e||2)  ||e||2  (173) 

—  (l  +  ((1  +  e)  +  2\/l  +  e)<5| fc(A))  ||  V(i)  —  X  ||F 
+  ((1  +  e)  +  2-\/l  +  e)24fe(A)\/l  +  4k(A)|| V(i)  —  X  |F ||e||2 
+  ((1  +  e)  +  2-\/l  +  e)  (1  +  4k(A))||e||2  (174) 

Completing  the  squares  in  (174)  and  eliminating  some  negative  terms,  we  obtain: 


|X(i  +  1)  —  X*||  F  <  (1+  ((l  +  e)  +  2VTT^)424A))(  ||V(i)-X* 


((1  +  e)  +  2 4 1  +  e)  (l  +  4* (A)) 
1  +  ((1  +  e)  +  2yT+  e)<5|fc(A) 


(175) 


Computing  the  square  root  on  both  sides  of  the  above  expression,  we  obtain  the  desired  inequality. 


G.  Proof  of  Lemma  3 

Let  X*  <—  ortho(X*)  be  a  set  of  orthonormal,  rank-1  matrices  that  span  the  range  of  X*.  In  Algorithm  3,  X(i  +  1)  is  the 
best  rank-A:  approximation  of  V(i).  Thus: 

||x(i  +  1)  -  V(i)||*  <  ll-y*  -  V(t)||“  =>  (176) 

||X(i  +  1)  -  X*  +  X*  -  V(i)||*  <  \\X*  -  V(t) ||F  =>  (177) 

||X(i  +  1)  -  X*||F  +  ||  V(t)  -  X*||F  +  2(X(i  +  1)  -  X*,  X*  -  V(i)>  <  ||X*  -  V(t)||®  =>  (178) 

||X(i  +  1)  -  X*||F  <  2(X(t  +  1)  -  X‘,  V(i)  -  X*)  (179) 

From  Algorithm  3,  it  is  obvious  that  i)  V(i)  £  span(<Si),  ii )  Q  £  span(<S;)  and  in)  W(i)  £  span(5;).  We  define 
£  :=  Si  U  X*  where  rank(span(£))  <  4fc  and  let  Vs  be  the  orthogonal  projection  onto  the  subspace  defined  by  £. 

Since  X(i  +  1)  —  X*  £  span(£)  and  V(i)  —  X*  £  span(£),  the  following  hold  true: 

X(i  +  l)-X*=P£(X(i  +  l)-X*)  and  V(i)  -  X*  =  Vs{V(i)  -  X*)  (180) 


due  to  Remark  3. 
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Then,  (179)  can  be  written  as: 

|x(*  +  i)-x*||F  (181) 

<  2 (Ve(X{i  +  1)  -  X*),V£(V(i)  -  X*))  (182) 

=  2 (Ve(X{i  +  1)  -  X*),Vs  (Qi  +  fiiPsi  A*  A(X*  -  Q<)  -  X*)}  (183) 

=  2 {Vs(X{i  +  1)  -  X*),Ve(Qi  ~  X *  -  iMVsiA*A(Qi  -  X*)))  (184) 

=  2 (X(i  +  1)  -  X*,Ve (Qi  -  X*)  -  HiVsV Si  A*  APe ( Q ;  -  X*)}  (185) 

=  2(X(i  +  1)  -  X*,Ve(Qi  -  X*)  -  iuVsiA* APeiQi  -  X*)}  (186) 

=  2 (X(i  +  1)  -  X* ,Ve{Qi  -  X*)  -  mVsiA* A[Psi  +  v£i\Ve(Qi  -  X*))  (187) 

=  2(X(i  +  1)  -  X*,  (/  -  HiVSi A* AVSi )Ve (Qt  -  X*))  -  2 m{X(i  +  1)  -  X* ,Vst  A* APs, Ve{Qi  -  X*)>  (188) 


(ii) 

<  2||X(t  +  1)  -  X*y|(7  -  mVs^APsJVeiQ,  -  X*)||F  +  2Mi|| X(i  +  1)  -  X*||F|| VS%A* AP^VsiQ,  -  X*)||F 

(189) 


where  (i)  is  due  to  Ve{Ql  —  X*)  :=  7 ysi'P£(Qi  —  X*)+Vsi'Pe{Qi  —  X*)  and  (ii)  follows  from  Cauchy-Schwarz  inequality. 

Since  1+4  (-A)  -  *  ~  1-4  (-A)’  Lemma  3  impli6S: 


A  {I  -  HiVSiA*  AVSi)  e 


1  1  —  5zk  (A)  1  +  S3  k  {A) 

~  1  +  <MA)’  l-S3k(A)  ~ 


< 


283k(A) 


and  thus: 


{I  -  mVSlA* AVsMQ.-  X*)\\p  <  T^|^L||p£(Q!  _x*)||F. 


Furthermore,  according  to  Lemma  4: 


\\TS,  A'APiVeiQ,  -  X*)||F  <  S4k(A)\\V^Ve(Qi  ~  X* 
since  rank  (PsuSiQ)  <  4  k,  VQ  €  TZmXn.  Since  Vsi'Ps{Qi  —  X*)  =  Vx*\(T>i  uxt)X*  where  Vt  :=  Vk  (Vxi  V/(Qi)),  then: 

II ViVsiQi  -  X*)||F  =  ||^*\(^u^)^lF  <  (2 53k(A)  +  2S4k(A))\\Qi  -  X 
using  Lemma  6.  Using  the  above  in  (189),  we  compute: 

^S3k  («4.) 


X(i+  1)  —  X*  < 


F  “  Vi  -Ssk(A) 


+  (2<53fc(«4.)  +  2S4,k(A)) 


253k{A) 

1  —  ^3fc(A) 


A) 

(190) 

F* 

(191) 

(192) 

=  Vk  {'PxX f{Qi)),  then: 

-**14 

(193) 

~x*h 

(194) 

Furthermore: 


||Qi  -  X*\\F  =  ||*(0  +  Ti(X(t)  -  X(i  -  1))||F 

=  ||(1  +  Ti)(X(i)  -  X*)  +  Ti(x*  -  X(t  -  1))||F 

<  (1  +  Tf)||X(*)  —  X*||F  +  Tj||X(i  —  1)  —  X*  || F  (195) 

Combining  (194)  and  (195),  we  get: 

||X((+l)-X-||,<(l+T,)(I^|tgy+(2fa(A)  +  2«14M),I31|tL5)||X(i)-X-||, 

+ +  (2S»<A> + w»  n+4b)  »x(i  - 11  -  x'll,  (•*) 

Let  a  :=  444)  +  (2<53fc(«4.)  +  2S4 k (A))  444)  anc*  ®(®)  ||-^(*  +  1)  —  X*||F.  Then,  (196)  defines  the  following 

homogeneous  recurrence: 


g(i  +  1)  -  «(1  +  Ti)g(i)  +  ang(i  -  1)  <  0 


(197) 


Using  the  method  of  characteristic  roots  to  solve  the  above  recurrence,  we  assume  that  the  homogeneous  linear  recursion  has 
solution  of  the  form  g{i)  =  rl  for  r  £  1Z.  Thus,  replacing  g(i)  =  f1  in  (197)  and  factoring  out  r^~2\  we  form  the  following 
characteristic  polynomial: 

r2  —  a(  1  +  Ti)r  —  an  <  0  (198) 
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Focusing  on  the  worst  case  where  (198)  is  satisfied  with  equality,  we  compute  the  roots  r+2  of  the  quadratic  characteristic 
polynomial  as: 


a(l  +  Ti)  ±  \/A  2/-,  .  \2  .  , 

ri,2  =  - - - ,  where  A  :=  a  (1  +  n)  +  4 an. 

Then,  as  a  general  solution,  we  combine  the  above  roots  with  unknown  coefficients  6i ,  62  to  obtain: 

'a(l  +  n)  +  v/A\i+1  ,  ( a(l  +  Ti)  —  v/A\i+1l 


(199) 


g(i  + 1)  < 


bi 


“)  +&2(“ 


X(0)-X* 


< 


(&i+M( 


a(l  +  Ti)  +  v/Ay+t 


l-XT(O)  -  -XT* 


(200) 


Using  the  initial  condition  g( 0) 
following  recurrence: 


|X(0)-X*| 


X(0)=0 


3  X* 


1,  we  get  61+62  =  1.  Thus,  we  conclude  to  the 


\X(i  +  l)-X* 


\  F  -  \ 


(a(  1  +  Ti)  +  v/Ay+t 


(201) 
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